DistMap provides a scalable and portable framework to map short reads on a Hadoop cluster. It is a complete pipeline especially designed for non-experts. Currently, DistMap supports 9 mappers:
BWA (http://soap.genomics.org.cn/soapaligner.html)
GSNAP (http://soap.genomics.org.cn/soapaligner.html)
TopHat (http://tophat.cbcb.umd.edu/)
Bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
Bowtie2 (http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.0.6/)
SOAP (http://soap.genomics.org.cn/soapaligner.html)
STAR (http://gingeraslab.cshl.edu/STAR/)
Bismask (http://www.bioinformatics.babraham.ac.uk/projects/bismark/)
BSMAP (http://code.google.com/p/bsmap/)
DistMap supports paired-end and single-end read mapping in FASTQ file format. It returns the final output as a SAM or BAM file.
DistMap componentsThe DistMap workflow consists of seven main modules which can be executed either end-to-end by a single command, or each module can be executed by giving appropriate command line flags..
The first module of DistMap creates genome indices and uploads these into the HDFS file system. Genome indices from previous DistMap runs are maintained as tgz files such that the user can re-use these via the option –reference-index-archive on the DistMap command line. The user can execute Genome indexing as a stand-alone module by giving the flag –only-index in the DistMap command line.
The second module of DistMap takes all input parameters, FASTQ formatted reads and mapper executables from the local computer and creates a final archive. This archive is then uploaded to the cluster. The user can execute Data processing as a stand-alone module by giving the flag –only-process in the DistMap command line.
This module loads the processed reads and the archive created in module 1 into the HDFS file system. The user can execute Data upload as a stand-alone module by giving the flag –only-hdfs-upload in the DistMap command line.
This module facilitates the mapping of the reads to the reference genome. The user can execute Data mapping as a stand-alone module by giving the flag –only-map in the DistMap command line.
This module implements data transfer from the HDFS on the local computer after the completion of mapping. The user can execute Data download as a stand-alone module by giving the flag –only-hdfs-download in the DistMap command line.
This module results in the combination of all outputs into a single SAM or BAM file in the local output directory. The user can execute Post processing output as a stand-alone module by giving the flag –only-merge in the DistMap command line.
The final module deletes all intermediate input and output files stored within the HDFS and on the local computer. The Data cleanup module is optional and it can be executed as a stand alone module with the flag –only-delete-temp in the DistMap command line.
DistMap is implemented in Perl and runs on all Unix operating systems. It requires Perl 5.8 or higher, MergeSamFiles.jar and SortSam.jar from PICARD (http://picard.sourceforge.net).
DistMap?Download source code of DistMap from http://distmap.googlecode.com/files/DistMap_v1.0.tar.gz and run following command on terminal to uncompress the source code.
tar -xzvf DistMap_v1.0.tar.gz
Run following command to get DistMap command line parameters.
perl DistMap_v1.0/distmap --help
DistMap command line parametersUsage: perl DistMap_v1.0/distmap
--reference-fasta Reference fasta file full path. MANDATORY parameter.
--input Provide input fastq files, either as a pair or single Fastq file.
For Paired-end data --input 'read1.fastq,read2.fastq'
For Single-end data --input 'read.fastq'
Multiple inputs are possible by repeating --input parameter
--input 'sample1_1.fastq,sample1_2.fastq' --input 'sample2_1.fastq,sample2_2.fastq'
It is necessary to give input file(s) within
single or double quotes. Paired files must be comma
separated. MANDATORY parameter.
--output Full path of output folder where final output will be kept. MANDATORY parameter.
--only-process Step1: This option will 1) convert FASTQ files into 1 line format, 2) create a genome index
and 3) create a job archive and exit. OPTIONAL parameter
--only-hdfs-upload Step2: This option assumes that the data processing is complete. It uploads the reads and the archive created in the
data process step into HDFS file system. OPTIONAL parameter
--only-map Step3: This option assumes that data is already loaded into the HDFS and will
run map on the cluster.
--only-hdfs-download Step4: This option assumes that mapping has been completed on the cluster. Files will be downloaded
from the cluster to a local output directory. OPTIONAL parameter
--only-merge Step5: This option assumes that the data is already in the local directory.
Data will be merged to create a single SAM or BAM output file. OPTIONAL parameter
--only-delete-temp Step6: This is the last step of the pipeline. This option deletes the
mapping data from HDFS file system as well as from local temp directory. Useful to clean up after large mapping jobs!
OPTIONAL parameter
--hadoop-home Gives the full path of the hadoop folder. MANDATORY parameter.
The hadoop home path should be identical in master, secondary namenode and
all slaves (nodes).
Example: --hadoop-home /usr/local/hadoop
--mapper Mapper name [bwa,tophat,gsnap,bowtie,soap]. MANDATORY parameter.
Example: --mapper bwa
--mapper-path Mapper executable full path. MANDATORY parameter.
Example: --mapper-path /usr/local/hadoop/bwa
--gsnap-output-split GSNAP has a feature to split different types of mapping output into different
SAM files.
For detail input: gsnap --help, read --split--output.
--split-output=STRING Basename for multiple-file output, separately for nomapping,
halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult,
paired_uniq, paired_mult, concordant_uniq, and concordant_mult results (up to 9 files,
or 10 if --fails-as-input is selected, or 3 for single-end reads)
--picard-mergesamfiles-jar PICARD MergeSamFiles.jar full path. It will be used to merge all
SAM or BAM files. MANDATORY parameter.
Example: --picard-mergesamfiles-jar /usr/local/hadoop/picard-tools-1.56/MergeSamFiles.jar
--picard-sortsam-jar PICARD SortSam.jar full path. It will be used for SAM BAM conversion. MANDATORY parameter.
Example: --picard-sortsam-jar /usr/local/hadoop/picard-tools-1.56/SortSam.jar
--mapper-args Arguments for mapping:
BWA mapping for aln command:
Example --mapper-args "-o 1 -n 0.01 -l 200 -e 12 -d 12"
Note: Define BWA parameters correctly according to the version used here.
TopHat:
Example: --mapper-args "--mate-inner-dist 200 --max-multihits 40 --phred64-quals"
Note: Define TopHat parameters correctly according to the version used here.
GSNAP mapping:
Example: --mapper-args "--pairexpect 200 --quality-protocol illumina"
Note: Define gsnap parameters correctly according to the version used here.
For detail about parameters visit [http://research-pub.gene.com/gmap/]
bowtie mapping:
Example: --mapper-args "--sam"
Note: Define gsnap parameters correctly according to the version used here.
For detail about parameters visit [http://bowtie-bio.sourceforge.net/index.shtml]
SOAPAlinger:
Example: --mapper-args "-m 400 -x 600"
Note: Define SOAPaligner parameters correctly according to the version used here.
For detail about parameters visit [http://soap.genomics.org.cn/soapaligner.html]
Please note that processor parameters are not required in arguments. For example BWA mapping does not require -t parameter.
This parameter is given by DistMap internally.
--bwa-sampe-args Arguments for BWA sampe or samse module.
bwa sampe for paired-end reads
bwa samse for single-end reads
Example --bwa-sampe-args "-a 500 -s"
--output-format Output file format either SAM or BAM.
Default: BAM
--job-desc Give a job description which will be displayed in JobTracker webpage.
Default: <mapper name> mapping.
Hadoop streaming Parameters:
--hadoop-scheduler Give the scheduler name:
1) Fair Scheduler (need to configure. http://hadoop.apache.org/docs/stable/fair_scheduler.html)
2) FIFO (it is default comes with Hadoop)
3) Capacity Scheduler (need to configure. http://hadoop.apache.org/docs/stable/capacity_scheduler.html)
If your hadoop has Fair Scheduler then provide a pool name with the parameter --queue-name
If your hadoop has Capacity Scheduler then provide queue name with the parameter --queue-name.
Example: --hadoop-scheduler Fair
--hadoop-scheduler Capacity
--hadoop-scheduler FIFO
--queue-name If your hadoop has Capacity Scheduler then provide --queue-name.
Example: --queue-name pg1
--job-priority Give the job priority to run your job.
Possible options: [ VERY_LOW | LOW | NORMAL | HIGH | VERY_HIGH ]
Example: --job-priority VERY_HIGH
--verbose/--help Print inputs on screen.
Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the available human reference sequence.
Step1: Download latest BWA source code from http://sourceforge.net/projects/bio-bwa/files/
Step2: Uncompress the downloaded file bwa-0.6.2.tar.bz2 with following command:
tar -xvjf bwa-0.6.2.tar.bz2
This command will return the folder/directory bwa-0.6.2
Step3: Enter into the uncompressed folder bwa-0.6.2 with following command:
cd /home/user/bwa-0.6.2
Step4: To make BWA executable run the sudo make command:
sudo make
After running the make command an executable called bwa will be created in /home/user/bwa-0.6.2 folder.
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/bwa-0.6.2/bwa
GSNAP (Genomic Short-read Nucleotide Alignment Program) is a mapper for RNA-seq data. It has the advantages that it can detect splicing events and is capable of SNP tolerant alignments (Wu & Nacu 2010).
Step1: Download the GSNAP source code from http://research-pub.gene.com/gmap/src/gmap-gsnap-2012-07-20.tar.gz
Step2: Uncompress the downloaded file “gmap-gsnap-2012-07-20.tar.gz” with following command:
tar -xzvf gmap-gsnap-2012-07-20.tar.gz
Step3: Enter into the uncompressed folder gmap-2012-07-20 with following command:
cd /home/user/gmap-2012-07-20
Step4: To install the gmap package run these four commands:
./configure --prefix=/home/user/gmap-2012-07-20
make
make check (optional)
make install
The above four commands will build GSNAP and other executables in /home/user/gmap-2012-07-20/bin
Step5: To check the GSNAP installations run this command:
/home/user/gmap-2012-07-20/bin/gsnap –help
This command will return the detailed help manual for various GSNAP parameters if the installation was successful.
Note: For more detailed information how to install GSNAP please read the gmap-2012-07-20/README file.
In /home/user/gmap-2012-07-20 folder you should get following executables:
/home/user/gmap-2012-07-20/bin/gsnap
/home/user/gmap-2012-07-20/bin/gmap_build
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/gmap-2012-07-20/bin/gsnap
TopHat mappingDownload TopHat binary from http://tophat.cbcb.umd.edu/downloads/tophat-2.0.6.OSX_x86_64.tar.gz for Macintosh or from http://tophat.cbcb.umd.edu/downloads/tophat-2.0.6.Linux_x86_64.tar.gz for Linux operating system.
Bowtie for TopHat mappingAs TopHat requires bowtie for mapping internally. Bowtie binaries can be found at http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/ and copy following bowtie executables from /home/user/bowtie-0.12.9 folder into /home/user/tophat-2.0.6 folder.
cp /home/user/bowtie-0.12.9/bowtie /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-build /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-build-debug /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-debug /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-inspect /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-inspect-debug /home/user/tophat-2.0.6/
In /home/user/tophat-2.0.6 folder you should get following executables:
/home/user/tophat-2.0.6/prep_reads
/home/user/tophat-2.0.6/tophat
/home/user/tophat-2.0.6/bowtie
/home/user/tophat-2.0.6/bowtie-build
/home/user/tophat-2.0.6/bowtie-build-debug
/home/user/tophat-2.0.6/bowtie-debug
/home/user/tophat-2.0.6/bowtie-inspect
/home/user/tophat-2.0.6/bowtie-inspect-debug
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/tophat-2.0.6/tophat
Note: Since TopHat was developed on Python 2.6, all slaves should have Python 2.6. This uses the python package called getopt which is called in /home/user/tophat-2.0.6/tophat
Download Bowtie binaries from http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/
In /home/user/bowtie-0.12.9 folder you should get following executables:
/home/user/bowtie-0.12.9/bowtie
/home/user/bowtie-0.12.9/bowtie-build
/home/user/bowtie-0.12.9/bowtie-build-debug
/home/user/bowtie-0.12.9/bowtie-debug
/home/user/bowtie-0.12.9/bowtie-inspect
/home/user/bowtie-0.12.9/bowtie-inspect-debug
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/bowtie-0.12.9/bowtie
Download Bowtie binaries from http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.0.6/
In /home/user/bowtie2-2.0.6 folder you should get following executables:
/home/user/bowtie2-2.0.6/bowtie2
/home/user/bowtie2-2.0.6/bowtie2-build
/home/user/bowtie2-2.0.6/bowtie2-build-debug
/home/user/bowtie2-2.0.6/bowtie2-debug
/home/user/bowtie2-2.0.6/bowtie2-inspect
/home/user/bowtie2-2.0.6/bowtie2-inspect-debug
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/bowtie2-2.0.6/bowtie2
SOAPaligner/soap2 is a short read mapper that allows gapped and un-gapped mapping of short reads.
Step1: Download latest SOAPaligner/soap2 source code from http:// soap.genomics.org.cn/down/SOAPaligner-v2.20-src.tar.gz
Step2: Uncompress the downloaded file SOAPaligner-v2.20-src.tar.gz with following command:
tar -xzvf SOAPaligner-v2.20-src.tar.gz
This command will return the folder/directory soap2.20 Step3: Enter into the uncompressed folder soap2.20 with following command:
cd /home/user/soap2.20
Step4: To make soap executable run the make command:
make
After running the make command, an executable called soap will be created in /home/user/soap2.20 folder.
Running SOAPaligner requires index files for the reference genome, reads can then be searched against the formatted index files.
Step1: Download latest SOAPaligner_builder source code from http://soap.genomics.org.cn/down/SOAPaligner-v2.20-src_builder.tar.gz
Step2: Uncompress the downloaded file SOAPaligner-v2.20-src_builder.tar.gz with following command:
tar –xzvf SOAPaligner-v2.20-src_builder.tar.gz
This command will return the folder/directory soap_builder
Step3: Enter in the uncompressed folder soap_builder with following command:
cd /home/user/soap_builder
Step4: To make soap executable run the make command:
make
After running the make command an executable called 2bwt-builder will be created into folder /home/user/soap_builder.
Step5: Copy 2bwt-builder file into /home/user/soap2.20 folder with following command:
cp /home/user/soap_builder/2bwt-builder /home/user/soap2.20/
This Perl script is required to convert soap output format into common SAM format.
Step1: Download latest soap2sam.pl source code from http:// soap.genomics.org.cn/down/soap2sam.tar.gz
Step2: Uncompress the downloaded file soap2sam.tar.gz with following command:
tar –xzvf soap2sam.tar.gz
This command will return the perl script soap2sam.pl
Step3: Copy soap2sam.pl file into /home/user/soap2.20 folder with following command:
cp soap2sam.pl /home/user/soap2.20/
In /home/user/soap2.20 folder you should get following executables:
/home/user/soap2.20/soap2
/home/user/soap2.20/2bwt-builder
/home/user/soap2.20/soap2sam.pl
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/soap2.20/soap2
Download binary of STAR from ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/2.2.0/
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/STAR_2.2.0c.Linux_x86_64/STAR
Bismark A bisulfite read mapper and methylation caller www.bioinformatics.babraham.ac.uk/projects/bismark/bismark_v0.7.7.tar.gz
Step1: Download the Bismark source code from www.bioinformatics.babraham.ac.uk/projects/bismark/bismark_v0.7.7.tar.gz
Step2: Uncompress the downloaded file “bismark_v0.7.7.tar.gz” with following command:
tar -xzvf bismark_v0.7.7.tar.gz
Step3: Enter the uncompressed folder “bismark_v0.7.7” with following command:
cd bismark_v0.7.7
As Bismark uses bowtie for mapping internally. Bowtie can be downloaded from http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/, copy following bowtie executables from /home/user/bowtie-0.12.9 folder into /home/user/bismark_v0.7.7 folder.
cp /home/user/bowtie-0.12.9/bowtie /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-build /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-build-debug /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-debug /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-inspect /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-inspect-debug /home/user/bismark_v0.7.7/
In /home/user/bismark_v0.7.7 folder you should get following executables:
/home/user/bismark_v0.7.7/bismark
/home/user/bismark_v0.7.7/bismark_genome_preparation
/home/user/bismark_v0.7.7/bismark_methylation_extractor
/home/user/bismark_v0.7.7/bowtie
/home/user/bismark_v0.7.7/bowtie-build
/home/user/bismark_v0.7.7/bowtie-build-debug
/home/user/bismark_v0.7.7/bowtie-debug
/home/user/bismark_v0.7.7/bowtie-inspect
/home/user/bismark_v0.7.7/bowtie-inspect-debug
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/bismark_v0.7.7/bismark
BSMAP is a short reads mapping software for bisulfite sequencing reads. Bisulfite treatment converts unmethylated Cytosines into Uracils (sequenced as Thymine) and leave methylated Cytosines unchanged, hence provides a way to study DNA cytosine methylation at single nucleotide resolution. BSMAP aligns the Ts in the reads to both Cs and Ts in the reference.
Step1: Download the BSMAP source code from http://bsmap.googlecode.com/files/bsmap-2.73.tgz
Step2: Uncompress the downloaded file “bsmap-2.73.tgz” with following command:
tar -xzvf bsmap-2.73.tgz
Step3: Enter the uncompressed folder “bsmap-2.73” with following command:
cd bsmap-2.73
Step4: To install the BSMAP package run these two commands:
sudo make
sudo make install
The above four commands will build BSMAP executable in “/usr/bin”
Step5: To check the BSMAP installations run this command:
bsmap –h
This command will return the detailed help manual for various BSMAP parameters if the installation was successful.
Note: For more detailed information how to run BSMAP, please read the bsmap-2.73/README.txt file.
Give the full path for DistMap command parameter
perl DistMap_v_1.0/distmap --mapper-path /home/user/bsmap-2.73/bsmap
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: mbel... (code.google.com)@gmail.com
It would be great to add sample data/commands here. The download links for sample data in the PDF manual are broken. Thanks!
View and moderate all "wiki Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Wiki"
Originally posted by: nid... (code.google.com)@gmail.com
Fantastic documentation. thanks.
Nidhan