MUMdex Genome Alignment and Analysis Software

Copyright 2015-2017 Peter Andrews @ CSHL Wigler lab

What is MUMdex?

The name MUMdex refers to the the alignment format for genomic sequencing data, but can also refer to the overall project or to the genome analysis software package.

The MUMdex alignment format is a compact binary representation of a mapped sequencing dataset which can be of use when analyzing one or more samples to find interesting mutations. The MUMdex format transforms input sequence into a set of MUMs to the reference genome found on each read pair, and encodes any non-mapped bases so that full input sequence reconstruction is easy.

Having all the MUMs from a read-pair available allows the detection of SNPs, indels, inversions, translocations and other complex and non-local rearrangements. Reference allele counts are determined under the same conditions as alternate allele counts, nearly eliminating mapping-related bias.

MUMs can be accessed in a contiguous block for each read pair in turn or in genome order using a sorted index. The MUMdex format is random-access, accessed using memory mapping, does not use time-intensive compression or decompression and typically achieves space reduction versus the input sequence. The MUMdex index is constructed using a suffix array much more quickly than programs such as bowtie or BWA and can be queried or accessed in its entirety in much less time than a bam file. Information such as quality scores and SAM-like optional fields can be retained and accessed as well.

The MUMdex package comes with many programs for genome analysis. A typical analysis workflow will first perform alignment using MUMdex mummer, then will merge alignemnt results into a MUMdex database using merge_mumdex. Then, bridges can be extracted using the bridges program. Once a family or a population is processed, population_bridges can be used to find novel events such as denovos.

There is also a copy number component to the MUMdex package. mumdex_cn will extract copy number results from a MUMdex file, while ggraph interactively displays copy number profiles. Programs like mumdex_finebin and empirical_bins allow you to create suitable bin boundaries from data over a population.

The MUMdex C++ Software Package

Latest / Recent Version

The latest released version of MUMdex can always be found at the MUMdex github page at https://github.com/docpaa/mumdex/.

You can download MUMdex using git by executing the command:

git clone https://github.com/docpaa/mumdex/

A recent version of MUMdex is frequently uploaded to this site at mumdex.zip. This zip version was uploaded at 2018-07-13 12:11:11.000000000 -0400

All MUMdex programs can be compiled by executing "make" from within the mumdex directory after cloning from git or unzipping the zip file. See the MUMdex requirements section below.

G-Graph

The ggraph program distributed with MUMdex software is an X11-based copy number viewer GUI which we believe is a great way to interactively explore copy number and other scatter plot data. The ggraph application runs under Linux, Mac with XQuartz, and Windows with Cygwin-X.

We are getting ready to publish a paper on G-Graph software, and you can take a preview look at it here: ggraph.pdf.

A G-Graph tutorial is currently under construction at ggraph/.

MUMdex Requirements

MUMdex requires at least a C++11 compiler.

Some individual programs in the MUMdex package require X11 or gsl support

Python is optional during compilation to lint code (check for warnings)

Python 2.{6|7} and numpy are required for the python package

Memory requirements for creating suffix array index for the Human genome are ~ 120GB. Mapping expects ~120 GB but will run with much less (but slowly). MUMdex file access is memory mapped, and will run with minimal memory.

Tested on Linux CentOS 5, 6 using GCC 4.9.2, 5.5.0, 6.4.0, 7.2.0 and on MAC Darwin using clang.

MUMdex Tutorial

See TUTORIAL.txt in the MUMdex distribution.

MUMdex Propaganda

MUMdex presentation

Please see a MUMdex presentation at talk.pdf.

bioRxiv paper

The MUMdex paper is on the bioRxiv site.

The software (an older version from the time of publication): mumdex.zip

The paper served from this site: mumdex_paper.pdf

Supplementary figures: mumdex_figures.pdf

Supplementary table: mumdex_table.xls

Population database: txt.gz files, genome used

The MUMdex Python Package

Most of the MUMdex package is written in C++, but an interface is also written in Python for ease of use and for distribution purposes.

All Releases (just the python parts and selected C++ code it uses)

MUMdex Python Package Contents

The MUMdex python package provides various scripts and classes:

Scripts

Classes

Please see the extensive Python help documentation and example scripts in the package (or installed wherever Python puts executables on your system).

To Get Started:

Linux$ tar -xvzf mumdex-0.XX.tar.gz
Linux$ cd mumdex-0.XX/
Linux$ python setup.py install --user
Linux$ python
>>> import mumdex
>>> help(mumdex)

You may need to edit the RECENT_GCC_BASE line in setup.py to point to a recent gcc version base install directory.

Further Information

Please contact the author of this software, Peter Andrews, via email at paa@drpa.us if you would like assistance.