MaryGold Code

Variation analysis of metagenomic samples

Status: Alpha

Brought to you by: jfnijkamp

Tree [9c1930] master / History

HTTPS access

File	Date	Author	Commit
scripts	2012-12-04	Jurgen Nijkamp	[c249f3] fully working on single core
src	2013-05-10	Jurgen Nijkamp	[a25952] changes for release 0.2
AUTHORS	2012-08-02	Jurgen Nijkamp	[6dfc42] Initial commit
COPYING	2013-04-24	Jurgen Nijkamp	[f9c3f6] code for release 0.1
ChangeLog	2012-08-02	Jurgen Nijkamp	[6dfc42] Initial commit
Makefile.am	2012-10-05	Jurgen Nijkamp	[a79dbe] motifDiff added
NEWS	2012-08-02	Jurgen Nijkamp	[6dfc42] Initial commit
README	2013-05-10	Jurgen Nijkamp	[a25952] changes for release 0.2
bootstrap	2012-08-07	Jurgen Nijkamp	[5b350a] onnly searcg through biconnected component
configure.ac	2014-01-13	Jurgen Nijkamp	[9c1930] Removed getPathweights from configure

Read Me

--*** MaryGold README ***--

DESCRIPTION

MaryGold is an open source software package for the analysis
of contig graphs generated from Next-Generation sequencing data.
By decomposing contig graphs into bi- and triconnected components
MaryGold generates potential source-sink pairs of a bubbles
that represent sequence variation.

MaryGold was designed for variation detection in metagenomics samples.
Both variation within and between metagenomics datasets can be explored.

MaryGold can also be used for sequence variation detection between
two single genomes by co-assembling them.

If you use MaryGold, than please cite:

J.F. Nijkamp, M. Pop, M.J.T. Reinders and D. de Ridder
Exploring variation aware contig graphs for (comparative) metagenomics using MaryGold.
submitted

Contact information: http://bioinformatics.tudelft.nl

INSTALLATION

On the MaryGold's Sourceforge page binaries are provided. These binaries
are compiled on and have only been tested on Linux 64-bit machines. The
AMOS and OGDF libraries have been statically linked, Boost has been
dynamically linked (required for 'printCounts').

Some postprocessing parts of MaryGold require Python, such as generation
of the input files for Circos and finding linear paths in the compressed
contig graph.

If you wish to use the software on another platform you will probably
have to compile MaryGold yourself.

Requirements for compilation of the C++ code :
- The AMOS library
- The Open Graph Drawing Library
- The Boost library

Instructions:
./bootstrap
./configure
make
make install

If the required libraries are not in the standard path than provide them to contigure:
./configure \
--with-AMOS-include-path=../amos/include/AMOS \
--with-AMOS-lib-path=../amos/lib/AMOS/ \
--with-BOOST-include-path=/usr/include/boost \
--with-BOOST-lib-path=/usr/lib64/ \
--with-OGDF-include-path=../OGDF/ \
--with-OGDF-lib-path=/data../OGDF/_release/ \

PYTHON requirements

For the python parts of MaryGold the modules numpy, scipy, editdist and biopython are required

USAGE

A. Finding multi-allelic sites with MaryGold

MaryGold consists of three main steps:

1. Converting the AMOS graph information (CTE and CTG bank accounts) to GML:
bnk2gml -b my.bnk > graph.gml

2. Finding separation pairs by decomposing graph into bi- and triconnected components
getSeppairs -i graph.gml > seppairs.txt

3. Run the bubble search algorithm using the separation pairs as seeds
buildMotifs [-troks] -b my.bnk -q seppairs.txt

This is an example using three E.coli samples, that have been labelled at the end
of each read with their strain name, for example read names in the fasta file could be:
>r77136_1_O157
>r15478_1_HS
>r126947_1_K12

B. Generating some informative files

1. The read depths per contig per sample

printCounts calculated the read depths and read counts for each samples using regular expressions:

printCounts -x ".*O157;.*HS;.*K12" -b ../proba.bnk/

2. Set a threshold on the read depth to indicate whether the contig has enough reads
to belong to the sample. The threshold can either be set per sample, or one
threshold for all samples.

readDepth2member -d readdepths.txt -t '1.2;1.5;2' > membership.txt

3. Generate ID map

iid2eid -b ../proba.bnk/ > iid2eid.txt

C. Generating linear sequences

This will generate linearscaf.fasta and linearscaf.txt

python motiftigger.py -f motifs.txt \
-d readdepths.txt \
-m membership.txt \
-i iid2eid.txt \
-g compressed.gml \
-b my.bnk/ \
-o linearscaf.txt

D. CIRCOS: Generating the circos figure with multi-allelic sites

1. Generating the Circos source files

python toCircos.py -b my.bnk -r readdepths.txt -z membership.txt -m motifs.txt -c compressed.gml -i iid2eid.txt

This will generate a number of files, which are required for Circos:
bands.conf
distfile.txt (Average edit distances between paths in bubble)
hist.stacked.0.txt (Inferred read depths for paths through the bubble)
hist.stacked.1.txt
hist.stacked.2.txt
ideogram.conf
ideogram.label.conf
ideogram.position.conf
karyotype.txt
marygold.conf
ticks.conf

The number of hist.stacked.*.txt files depends of the number of samples.

2. Now you are ready to generate the Circos figure:

circos -conf marygold.conf

MaryGold Code

Variation analysis of metagenomic samples

Branches

Tree [9c1930] master / Download Snapshot History

Read Me

Tree [9c1930] master /

History