G-Graph Tutorial

Copyright 2017-2018 by Peter Andrews at Cold Spring Harbor Laboratory.

What is G-Graph?

G-Graph is a free-software graphical desktop application that allows you to perform efficient exploratory analysis of genomic copy number and other numeric datasets. G-Graph can be installed and runs under the Linux, Mac, and Windows operating systems. G-Graph's features include the ability to easily zoom and scroll the view, alter the visual display properties, display gene annotations, link genes to the UCSC genome browser, and save application views to image and pdf format.

G-Graph's target user is a researcher or clinician who needs to easily explore the details of many copy number profiles in order to identify potentially deleterious DNA gain or loss mutation.

The G-Graph paper is currently being prepared for journal submission. An early preprint of the paper is available in pdf format: ggraph.pdf.

Sample G-Graph application view

Below is a G-Graph zoomed-in screenshot showing binned copy number ratios (points) and called segments (lines) for a family of four over a megabase portion of the X chromosome. The binned ratios for the mother and daughter are displayed using red and green points, while the father and son are shown using blue and yellow points. Contextual help text is displayed on the upper status line, hyperlinked gene names and gene structure are shown at the top of the graph region, simulated cytoband staining is shown at the bottom of the graph region, and radio-buttons with various functionality are present along the left, bottom and right borders of the application.

G-Graph screenshot, from the paper

Sample G-Graph session movie

Below is a narrated movie of a short user session of G-Graph, to demonstrate some of its functionality. Press anywhere on the movie image to play it.

Note the movie is slightly out of date with the updated application. Will be fixed soon.

Installing G-Graph

The easy way to download, compile and setup

Try automated G-Graph downloading, compilation and setup! Mac OSX requires no prerequisites (though you may want to run "xcode-select --install" first to avoid a complete Xcode install). On Windows, you will first need to install Cygwin as an administrator and install at least the git and wget packages (on the package selection screen, search for them using the search box; git is in devel and wget is in web). On Linux, you will first need X development packages, a C++ compiler, git, plus (optionally) ImageMagick and firefox to be installed.

Then, execute the two following commands in a terminal as an administrator (Cygwin terminal on Windows):

git clone https://github.com/docpaa/mumdex/
./mumdex/cn/download_and_setup_ggraph.sh hg19

The script mumdex/cn/download_and_setup_ggraph.sh should install everything you need, including prerequisites on Mac and Windows (you must be an administrator on Windows and Mac - you may be prompted for permissions or asked to go through some install GUIs), and will give you instructions for running G-Graph using provided sample data. The script accepts one or two command line arguments which must be one of "hg19", "hg38" or "hg19 hg38".

The hard way to download, compile and setup

G-Graph requires certain prerequisites to be installed before compilation and running of the application. In order to compile G-Graph, you will only need to have Xlib (X11 or Xorg development package) and a C++11 compiler (gcc 4.9 or later, or clang). To run G-Graph, you only need an active X11 display for most functionality. To save images as png and pdf, the ImageMagick package convert program needs to be installed. To link out to the UCSC browser, you need to have a particular web browser installed on Linux (see below).

G-Graph is distributed as part of the MUMdex genome analysis software package. To install G-Graph, first install all prerequisites, then acquire MUMdex source code by either downloading and unzipping the latest MUMdex package zip file, or by locally cloning the MUMdex git repository. The next step is to cd into the resulting "mumdex" directory and type "make ggraph" or "make -j N ggraph" (for a parallel make using N threads) at the command line prompt. A recent Linux distribution is likely to have all prerequisites (just gcc, gsl and Xlib) for MUMdex already installed or easily available. Note that gsl is not required for G-Graph, but is required for some other MUMdex programs. On Mac systems, you will first need to install Xcode and XQuartz (X11). On Windows systems you will first need a linux-like environment such as Cygwin available, making sure to install the g++ compiler and X11 development libraries. On all systems, ImageMagick needs to be installed for the png and pdf saving abilities of G-Graph to be functional.

Sample data and genome files

The following genome and configuration files are all downloaded and put in the appropriate places in your current directory when you follow "The easy way" method above.

The sample data files used in the paper are found in the /ggraph/data/ directory on this website

The genome and gene / cytoband definition files are found in the /ggraph/config/hg19/ and /ggraph/config/hg38/ directories on this website.

The G-Graph command line

G-Graph is intended to be launched from a shell prompt using a specific set of command line arguments, which may be customized to suit your individual circumstances (genome, data files, data columns).

Simple usage explained

You can run G-Graph on the sample data files using the following command, assuming you used "The easy way" setup method for hg19 above:

./mumdex/ggraph cn hg19.fa abspos,ratio,seg {m,f,d,s}.txt

The above command line launches the ggraph application. Below, I explain each component of the command line in detail:

The ggraph program is assumed to be located in the mumdex subdirectory of your current working directory. You may alternatively specify an absolute path to ggraph if desired, or place the mumdex directory in your PATH environment variable (so you only need to say "ggraph" with no path).
The "cn" view is selected, which activates genome and copy number features. An alternative is "genome" which suppresses copy number line display. Leaving out this argument and the following (reference) one disables all genome related features, turning G-Graph into a generic graphing program.
The reference fasta file "hg19.fa" is assumed to be located in your current working directory. An absolute path to the reference genome is also allowed, but the directory must be writable by you since G-Graph will create a binary cache of the reference at that location. Any reference may be used, provided the four annotation files are available. G-Graph is hardwired to display only chromosomes above 40 Mbase in size in the order provided in the fasta file, and defines abspos accordingly.
Next, "abspos,ratio,seg" are specified as the column names to display as the X axis and two Y axis values. The column names are arbitrary, but must match actual column names in your data files. If your data files do not have a header line you may select columns by number, starting with 1 for the first column. By default, each Y series is displayed with pointlike markers, unless there are exactly two Y values, where the second Y value (seg above) is displayed as lines. Point or line display for Y values can be explicitly specified by placing a ":p" or ":l" after a column name. For example, the string "time,height:p,weight:l,mass" would plot height and mass as points but weight as lines (points are the default) versus time. Without the ":l", all three Y values would instead be displayed as points. Explicit point and line specification does not currently work with numbered columns.
The data file specification "{m,f,d,s}.txt" is actually shell shorthand for "m.txt f.txt d.txt s.txt", the four tabular data files each containing data for one sample. Absolute paths are also permitted for each data file, and are actually required if the files are not in your current working directory.

Below is a screenshot of the G-Graph application view produced by executing the exact command described above. It displays several differences from the screenshot shown at the beginning of this tutorial. First, no radio-controls or status line text is visible, because that information is temporarily cleared when the pointer exits the application window, as was the case here. Second, the entire dataset is displayed because no zoom into chromosome X was done. Third, the window aspect ratio and size are different (html display may hide size difference - click or download to see images at actual size) because this is the default application size. Fourth, no cytobands are displayed because that is the default behavior. Fifth, the point size used here is the smaller default size. Sixth, no gene structures are displayed, because the X axis range is over 100 Mbase in length. Seventh, no gene names are displayed, because the X axis range is over 10 Mbase in length. Eighth, chromosome boundaries are now displayed because this view includes multiple chromosomes. Ninth, chromosome names have a different font size, because fonts scale with window size and available space between chromosome boundaries. Tenth, this screenshot is from G-Graph running under Windows with Cygwin, while the first one was from a Mac.

G-Graph default screenshot

This particular view, more than the first view, shows the utility of being able to change the stacking order of the data series and show series in a tiled view. This fully zoomed out display with the default marker size makes the mother's and father's red and blue series practically invisible since they are almost completely covered by the kids' series. Note that the segmentation lines for all four members remains visible above the ratio markers. This is because all corresponding Y values are placed together in the stacking order. Since ratios were specified before segs in the command line, ratios will always be displayed below segs.

Optional command line arguments

There are a few optional command line arguments that if used must only go right after the ggraph command (before cn):

--geometry WIDTHxHEIGHT[+-]XOFF[+-]YOFF
The two "--geometry" command line arguments start G-Graph using a specified window size WIDTHxHEIGHT, and places it relative to your screen borders using XOFF and YOFF, if allowed by your window manager. A plus before either XOFF or YOFF will attempt to position the window relative to the top left corner of your screen, while a minus sign will attempt to position it relative to the bottom right corner.
The five "--initial" command line arguments set the initial view of the application so you can zoom into specific regions of the genome in an automated fashion.
--threads NTHREADS
The two "--threads" command line arguments allow you to set a specific number of threads to use while using G-Graph. Multiple threads are used during data file loading, to speed gene info loading, and to prepare new views of the data in parallel. By default, G-Graph uses up to as many threads as your operating system has available.
--rows NROWS
The two "--rows" command line arguments tell G-Graph to expect exactly the number of data rows per file specified, to save space and prevent progressive memory allocation and copying during loading. If NROWS does not match the actual number of rows nothing bad will happen, but there may be some space or time inefficiency.
The single "--fullscreen" command line argument starts G-Graph at the maximum size for your screen. It overrides the --geometry arguments.
The single "--jitter" command line argument adds randomness to loaded X axis values to prevent all points from being too closely plotted on top of each other. X axis ordering of points is not affected.
--percent PERCENT
The two "--percent" command line arguments tell G-Graph to only load a specified percentage of the data points, which may be useful when loading many large datasets. Usually it would be better to re-analyze the data with larger bins than to use this option.
The single "--help" command line argument displays the complete command usage with some finer usage details.

For each of the optional arguments fully spelled out, there is also a shorter version with one dash and the first character of the full argument. This means that -g, -i, -t, -r, -f, -h will also work when substituted for the full names above.

Complete command line usage example

For an example which uses many of the optional command line arguments, to exactly reproduce the figure from the paper while using just two threads, execute the command:

./mumdex/ggraph --threads 2 --geometry 2000x1120+0+0 \
       --initial 2983775000 2984775000 0 2.5 --rows 500000 \
       cn hg19.fa abspos,ratio:p,seg:l {m,f,d,s}.txt

and once the application loads, turn on cytoband display and increase the point size by two units for an exact reproduction.

Note that the ":p" and ":l" point and line specification above after the column names ratio and seg are unnecessary in this case, since by default points and lines would have been used when there are exactly two Y values to display.

Other details, mostly repeated

In the examples above, binned ratio and segmentation results were plotted versus absolute genome position for four samples. You can choose to plot more or less data per sample, and more or less samples by just changing the command line. Your data only need to be text based and tabular in format, with labeled column headings. If you do not have labeled column headings in your data files, just specify column numbers instead of column names on the command line.

Instead of using "cn" on the command line, you can specify "genome" to exclude display of ratio lines. If you leave out the first two arguments, all genome-specific features of G-Graph will be absent so you can plot arbitrary x-y scatterplots using a simplified G-Graph interface.

The G-Graph GUI interface

G-Graph is written to use the low-level Xlib API directly, without using a standard widget toolkit like most other applications do. This means G-Graph display can be very fast, and the interface is unlike all other applications since it is all custom-designed, and installation is much easier on all Desktop platforms since you do not need a specific windowing toolkit. It also means that the G-Graph interface is identical on all platforms. G-Graph does not use typical widgets like the menus you are accustomed to - instead, you interact directly with the application window and its contained radio-controls. This is a direct, effective and very quick method of control which would become unwieldy or impossible if G-Graph had many more features.

The application window

The G-Graph application window can be moved and resized at will using the standard mechanisms allowed by your window manager. This usually means you can move the application by dragging the title bar, and resize the application by dragging the bottom right corner of the application. On Linux and Windows under Cygwin you can also resize by dragging any edge of the application window. G-Graph can usually be closed by typing 'q', clicking on the X at the top right corner in Linux and Windows, the red circle at upper left on Mac, or by hitting control-c on the command line. Window minimization and maximization are also usually available. All these features ultimately depend on the window manager used. If the G-Graph window is resized or moved it will be redrawn, possibly using a different font and radio button size.

The G-Graph status line

The top status line of the G-Graph interface is by default devoted to displaying tooltip messages. Everywhere the pointer focus goes in the application window, the status line will display an appropriate message as to what pointer actions will accomplish at that location. This feature can be turned off using the appropriate radio button control near the top left of the application. You can also replace the tooltip messages with a coordinate display which shows either abspos or chrpos, depending upon the status of the chrpos and gene display radio button control. When a gene name is hovered over, the status line will show annotation for the gene. When the cytoband region is hovered over with the pointer, the standard name for that cytoband will be displayed in the status line.

Non-radio-control pointer actions

In the graphing region and along the borders (with the exception of areas within the radio-controls and gene names), pointer clicks will center, zoom in or zoom out depending on whether the pointer button used is 1, 2 (or shift) or 3 (or control). Similarly, pointer drags will either select, scroll or zoom continuously. Pointer actions in the graphing region will affect both axes while those along the borders will affect only the closest axis. Clicking on gene names (when displayed) will open your web browser to the UCSC genome browser for the region covering the gene.

Not to gloss over an important detail, G-Graph was designed for use with a three-button mouse. If you do not have three buttons on your mouse, or if you use a trackpad or similar device, you can emulate a three button mouse by using shift or control while clicking with your primary button, whatever that is. If you press shift, then click with your primary button, that is the same as a second mouse button click. Likewise, pressing control then clicking will act like a third mouse button click. If you press shift or control after a primary button click but before the click release, the click will still function as a primary button click.

Radio-control pointer actions

The little circles around the left, bottom, and right borders of the G-Graph application are called radio-controls. The radio-controls are used to change the appearance of the graph and control the application. The radio-controls come in two varieties: togglable and non-togglable.

The togglable radio-controls are either in an on state (indicated by a filled black circle in the center) or an off state (empty center) at any time. An example of a togglable control is the control that switches between showing and not showing cytoband information.

The non-togglable controls perform an immediate and repeatable action when pressed (and momentarily appear to be "on"), but do not flip states in an alternating manner like the togglable controls. An example of a non-togglable control is the control that makes the markers bigger, which can be repeated until absurdity arrives.

Some controls may be inactive at times, depending upon certain conditions, and this is indicated by the circular control border being a light gray instead of a black color. For example, if the markers are made smaller repeatedly, when the minimum marker size has been reached the control will become inactive.

The colored radio-controls at the right are intended to allow you to toggle display of individual series, and control the stacking order of the data series in the graphing region. The colors of each control correspond to the color of the markers and lines used to display the series. You can actually display hundreds of series with G-Graph, but after a point the series radio-controls start to overlap and it becomes difficult to resolve and use them. G-Graph does not assume any particular family structure, but the first two series point colors are by default red and blue which may be convenient to assign to mother and father, if applicable for your dataset.

Radio-control help text

Below is the screen shown when the mouse hovers over the help radio button in G-Graph. The screen shows the function of every radio button in the interface.

The exact help text strings (or slightly longer versions) shown in the figures below are displayed by the G-Graph application in the top status line during usage if the pointer enters any radio button region (other than the help text radio). This tooltip displaying behavior is disabled if the help text radio button is turned off.

G-Graph radio control tooltip help screen

Stacked vs tiled view

By default, G-Graph displays all series stacked one on top of another in input file order. If multiple columns are displayed from each file, each column from all files is displayed stacked on top of previous columns from all files. This means that if you display ratio points and segmentation lines from each file, all of the segmentation lines will appear on top of all of the ratio points.

An alternate tiled view is available by toggling the tiled view radio button. In the tiled view mode, data from each file is displayed in a pane of its own, so that differences between the data in the files can be seen more easily. Several features ae disabled while in tiled view mode, such as display of ratio lines, graph region drag zoom selection, Y axis drag selection and graph region Y click centering and zoom.

A screenshot of the tiled view for the sample dataset is displayed below. This screenshot was taken for G-Graph running under Linux.

G-Graph tiled view screenshot

Series color selection

If you click any series selector radio-control with the middle or right pointer button, a color selection dialog (shown in the figure below) will pop up which allows you to select a color to display the series with by clicking on a color box. Clicking the series radio control with the middle button launches a selector which exits immediately after color selection. Clicking the series radio control with the right button launches a selector that stays open until it is explicitly closed, or until a selection is made with the middle or right pointer button. This behavior allows you to either make an immediate color selection or first check multiple colors for the series until the selection is confirmed. Clicking the lower right restore default series properties radio-control in the main G-Graph window will also reset series colors to the default set.

G-Graph color chooser screenshot

Keyboard input

G-Graph accepts keyboard input in a variety of situations to replicate or extend already existing mouse-based functionality, or to provide new features. Below is a list of the current options:

Known bugs

The --fullscreen option on a dual-screen machine may end up showing half a window on one screen only.

At some scales it looks like the gene display is messed up - gene lines may cover the entire display from left to right when that is not the correct view.

UCSC gene name browser links will only work if your reference name matches the genome name at UCSC.

X11plot graphing application

G-Graph is based on a custom generic graphing module. I have also created another program called x11plot that helps you to more easily explore multiple columns from any tabular dataset. It is included in the MUMdex distribution just like ggraph is. The program x11plot accepts one command line argument - the name of a tabular data file with column headings. The program reads in the file and presents an interface for you to choose an independent (X axis) variable, and one or more dependent (Y axis) variables to display.

X11plot series chooser interface

Below is a screenshot of the x11plot series chooser interface shown when the command "./mumdex/x11plot m.txt" is executed for the sample data file m.txt. It allows you to select one X axis variable, and as many Y axis variables as you want by clicking the pointer in the blank boxes. Once you have chosen at least one X axis variable and one or more Y axis variables, the top right radio-control becomes active. If you press the radio-control, a graph interface will be displayed for the selected series and the series selections will be cleared. The top left control also allows manual clearing of all your selections at once, or selctions may be cleared individually by pressing the appropriate box again. Multiple graphs can be launched using this interface, even if previous graphs were not closed. This program is useful for quickly exploring relationships between arbitrary columns in a dataset without having to type anything (after program launch). In the screenshot, abspos is selected as the X axis variable, while GC content proportion and the copy number ratio are selected as the Y axis variables from the sample data file m.txt, so the graph is ready to launch when the top right radio-control is clicked.

X11plot series chooser screenshot

X11plot graphing interface

Below is the graph displayed by the previous section's launch. It uses the same basic interface as G-Graph, with some small differences. First, all genome-specific interface elements are missing. Second, the default point size display is much smaller, which is more appropriate for displaying arbitrary very large datasets. Third, minor and major grid lines are displayed by default in x11plot graphs. Fourth, all series are displayed using just points, though line-based display for all series can be enabled using the controls at the bottom right. If you need to display points for some series and lines for others in the same graph, use ggraph instead.

X11plot graph screenshot

Customizing G-Graph and x11plot

G-Graph is written in C++, uses the low-level X11 libraries for display, and is based on an extensible custom generic graphing module defined in the mumdex/utility/x11plot.h header file. You can edit that file and recompile to change things like the default point size or colors. All of the genome-specific functionality of G-Graph is implemented using call-back functions presented to the lower-level generic graphing module in the mumdex/cn/ggraph.cpp source file. You can edit that file to introduce new functionality - for example see how solid horizontal ratio lines were added to the display using the add_ratio_lines function.