Menu

Tree [9c1930] master /
 History

HTTPS access


File Date Author Commit
 scripts 2012-12-04 Jurgen Nijkamp Jurgen Nijkamp [c249f3] fully working on single core
 src 2013-05-10 Jurgen Nijkamp Jurgen Nijkamp [a25952] changes for release 0.2
 AUTHORS 2012-08-02 Jurgen Nijkamp Jurgen Nijkamp [6dfc42] Initial commit
 COPYING 2013-04-24 Jurgen Nijkamp Jurgen Nijkamp [f9c3f6] code for release 0.1
 ChangeLog 2012-08-02 Jurgen Nijkamp Jurgen Nijkamp [6dfc42] Initial commit
 Makefile.am 2012-10-05 Jurgen Nijkamp Jurgen Nijkamp [a79dbe] motifDiff added
 NEWS 2012-08-02 Jurgen Nijkamp Jurgen Nijkamp [6dfc42] Initial commit
 README 2013-05-10 Jurgen Nijkamp Jurgen Nijkamp [a25952] changes for release 0.2
 bootstrap 2012-08-07 Jurgen Nijkamp Jurgen Nijkamp [5b350a] onnly searcg through biconnected component
 configure.ac 2014-01-13 Jurgen Nijkamp Jurgen Nijkamp [9c1930] Removed getPathweights from configure

Read Me

--*** MaryGold README ***--


DESCRIPTION

MaryGold is an open source software package for the analysis 
of contig graphs generated from Next-Generation sequencing data.
By decomposing contig graphs into bi- and triconnected components
MaryGold generates potential source-sink pairs of a bubbles 
that represent sequence variation.

MaryGold was designed for variation detection in metagenomics samples.
Both variation within and between metagenomics datasets can be explored.

MaryGold can also be used for sequence variation detection between
two single genomes by co-assembling them.


If you use MaryGold, than please cite:

J.F. Nijkamp, M. Pop, M.J.T. Reinders and D. de Ridder 
Exploring variation aware contig graphs for (comparative) metagenomics using MaryGold.
submitted

Contact information: http://bioinformatics.tudelft.nl


INSTALLATION

On the MaryGold's Sourceforge page binaries are provided. These binaries
are compiled on and have only been tested on Linux 64-bit machines. The
AMOS and OGDF libraries have been statically linked, Boost has been
dynamically linked (required for 'printCounts').

Some postprocessing parts of MaryGold require Python, such as generation
of the input files for Circos and finding linear paths in the compressed 
contig graph. 

If you wish to use the software on another platform you will probably
have to compile MaryGold yourself.

Requirements for compilation of the C++ code :
- The AMOS library
- The Open Graph Drawing Library
- The Boost library

Instructions:
./bootstrap
./configure
make
make install


If the required libraries are not in the standard path than provide them to contigure:
./configure \
--with-AMOS-include-path=../amos/include/AMOS \
--with-AMOS-lib-path=../amos/lib/AMOS/  \
--with-BOOST-include-path=/usr/include/boost  \
--with-BOOST-lib-path=/usr/lib64/  \
--with-OGDF-include-path=../OGDF/  \
--with-OGDF-lib-path=/data../OGDF/_release/ \


PYTHON requirements

For the python parts of MaryGold the modules numpy, scipy, editdist and biopython are required



USAGE

A. Finding multi-allelic sites with MaryGold
  
  MaryGold consists of three main steps:
  
  1. Converting the AMOS graph information (CTE and CTG bank accounts) to GML:
    bnk2gml -b my.bnk > graph.gml
  
  2. Finding separation pairs by decomposing graph into bi- and triconnected components
    getSeppairs -i graph.gml > seppairs.txt
  
  3. Run the bubble search algorithm using the separation pairs as seeds
    buildMotifs [-troks] -b my.bnk -q seppairs.txt 

  This is an example using three E.coli samples, that have been labelled at the end 
  of each read with their strain name, for example read names in the fasta file could be:
    >r77136_1_O157
    >r15478_1_HS
    >r126947_1_K12
  

B. Generating some informative files
  
  1. The read depths per contig per sample
    
     printCounts calculated the read depths and read counts for each samples using regular expressions:
    
       printCounts -x ".*O157;.*HS;.*K12" -b ../proba.bnk/
  
  2. Set a threshold on the read depth to indicate whether the contig has enough reads
     to belong to the sample. The threshold can either be set per sample, or one
     threshold for all samples.
     
       readDepth2member -d readdepths.txt -t '1.2;1.5;2' > membership.txt
       
  3. Generate ID map
     
      iid2eid -b ../proba.bnk/ > iid2eid.txt 
             


C. Generating linear sequences

  This will generate linearscaf.fasta and linearscaf.txt
  
  python motiftigger.py -f motifs.txt \
                        -d readdepths.txt \
                        -m membership.txt \
                        -i iid2eid.txt \
                        -g compressed.gml \
                        -b my.bnk/ \
                        -o linearscaf.txt



D. CIRCOS: Generating the circos figure with multi-allelic sites
  
  
  1. Generating the Circos source files

    python toCircos.py -b my.bnk -r readdepths.txt -z membership.txt -m motifs.txt -c compressed.gml -i iid2eid.txt
    
    This will generate a number of files, which are required for Circos:
      bands.conf
      distfile.txt              (Average edit distances between paths in bubble)
      hist.stacked.0.txt        (Inferred read depths for paths through the bubble)
      hist.stacked.1.txt        
      hist.stacked.2.txt
      ideogram.conf
      ideogram.label.conf
      ideogram.position.conf
      karyotype.txt
      marygold.conf
      ticks.conf
      
    The number of hist.stacked.*.txt files depends of the number of samples.

  2. Now you are ready to generate the Circos figure:
    
    circos -conf marygold.conf
    
    
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.