distmap Wiki

A toolkit for distributed short read mapping

Brought to you by: ramvinay

Manual

Authors: Anonymous

1. Introduction
- DistMap components
- Module1: Genome indexing:
- Module2: Data processing:
- Module3: Data upload:
- Module4: Data mapping:
- Module5: Data download:
- Module6: Post processing output:
- Module7: Data cleanup:
2. System Requirements
3. How to run DistMap ?
- DistMap command line parameters
4 BWA mapping
5 GSNAP mapping
6 TopHat mapping
- Get Bowtie for TopHat mapping
7 Bowtie mapping
8 Bowtie2 mapping
9 SOAP mapping
- SOAPaligner/soap2 installation
- SOAPaligner_builder installation
- soap2sam.pl installation:
10 STAR mapping
11 Bismak mapping
- Get Bowtie for Bismark mapping
12 BSMAP mapping

1. Introduction

DistMap provides a scalable and portable framework to map short reads on a Hadoop cluster. It is a complete pipeline especially designed for non-experts. Currently, DistMap supports 9 mappers:

BWA (http://soap.genomics.org.cn/soapaligner.html)
GSNAP (http://soap.genomics.org.cn/soapaligner.html)
TopHat (http://tophat.cbcb.umd.edu/)
Bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
Bowtie2 (http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.0.6/)
SOAP (http://soap.genomics.org.cn/soapaligner.html)
STAR (http://gingeraslab.cshl.edu/STAR/)
Bismask (http://www.bioinformatics.babraham.ac.uk/projects/bismark/)
BSMAP (http://code.google.com/p/bsmap/)

DistMap supports paired-end and single-end read mapping in FASTQ file format. It returns the final output as a SAM or BAM file.

`DistMap` components

The DistMap workflow consists of seven main modules which can be executed either end-to-end by a single command, or each module can be executed by giving appropriate command line flags..

Module1: Genome indexing:

The first module of DistMap creates genome indices and uploads these into the HDFS file system. Genome indices from previous DistMap runs are maintained as tgz files such that the user can re-use these via the option –reference-index-archive on the DistMap command line. The user can execute Genome indexing as a stand-alone module by giving the flag –only-index in the DistMap command line.

Module2: Data processing:

The second module of DistMap takes all input parameters, FASTQ formatted reads and mapper executables from the local computer and creates a final archive. This archive is then uploaded to the cluster. The user can execute Data processing as a stand-alone module by giving the flag –only-process in the DistMap command line.

Module3: Data upload:

This module loads the processed reads and the archive created in module 1 into the HDFS file system. The user can execute Data upload as a stand-alone module by giving the flag –only-hdfs-upload in the DistMap command line.

Module4: Data mapping:

This module facilitates the mapping of the reads to the reference genome. The user can execute Data mapping as a stand-alone module by giving the flag –only-map in the DistMap command line.

Module5: Data download:

This module implements data transfer from the HDFS on the local computer after the completion of mapping. The user can execute Data download as a stand-alone module by giving the flag –only-hdfs-download in the DistMap command line.

Module6: Post processing output:

This module results in the combination of all outputs into a single SAM or BAM file in the local output directory. The user can execute Post processing output as a stand-alone module by giving the flag –only-merge in the DistMap command line.

Module7: Data cleanup:

The final module deletes all intermediate input and output files stored within the HDFS and on the local computer. The Data cleanup module is optional and it can be executed as a stand alone module with the flag –only-delete-temp in the DistMap command line.

2. System Requirements

DistMap is implemented in Perl and runs on all Unix operating systems. It requires Perl 5.8 or higher, MergeSamFiles.jar and SortSam.jar from PICARD (http://picard.sourceforge.net).

3. How to run `DistMap`?

Download source code of DistMap from http://distmap.googlecode.com/files/DistMap_v1.0.tar.gz and run following command on terminal to uncompress the source code.

tar -xzvf DistMap_v1.0.tar.gz

Run following command to get DistMap command line parameters.

perl DistMap_v1.0/distmap --help

`DistMap` command line parameters

Usage: perl DistMap_v1.0/distmap

--reference-fasta           Reference fasta file full path. MANDATORY parameter.

--input                     Provide input fastq files, either as a pair or single Fastq file.
                For Paired-end data --input 'read1.fastq,read2.fastq'
                For Single-end data --input 'read.fastq'
                Multiple inputs are possible by repeating --input parameter

                --input 'sample1_1.fastq,sample1_2.fastq' --input 'sample2_1.fastq,sample2_2.fastq'
                It is necessary to give input file(s) within
                single or double quotes. Paired files must be comma
                separated. MANDATORY parameter.

--output                    Full path of output folder where final output will be kept. MANDATORY parameter.

--only-process          Step1: This option will 1) convert FASTQ files into 1 line format, 2) create a genome index
                    and 3) create a job archive and exit. OPTIONAL parameter

--only-hdfs-upload              Step2: This option assumes that the data processing is complete. It uploads the reads and the archive created in the
                data process step into HDFS file system. OPTIONAL parameter

--only-map                  Step3: This option assumes that data is already loaded into the HDFS and will
                run map on the cluster.

--only-hdfs-download            Step4: This option assumes that mapping has been completed on the cluster. Files will be downloaded
                from the cluster to a local output directory. OPTIONAL parameter

--only-merge                Step5: This option assumes that the data is already in the local directory.
                Data will be merged to create a single SAM or BAM output file. OPTIONAL parameter

--only-delete-temp              Step6: This is the last step of the pipeline. This option deletes the
                mapping data from HDFS file system as well as from local temp directory. Useful to clean up after large mapping jobs!
                OPTIONAL parameter

--hadoop-home               Gives the full path of the hadoop folder. MANDATORY parameter.
                The hadoop home path should be identical in master, secondary namenode and
                all slaves (nodes).
                Example: --hadoop-home /usr/local/hadoop

--mapper                    Mapper name [bwa,tophat,gsnap,bowtie,soap]. MANDATORY parameter.
                Example: --mapper bwa

--mapper-path               Mapper executable full path. MANDATORY parameter.
                Example: --mapper-path /usr/local/hadoop/bwa
--gsnap-output-split        GSNAP has a feature to split different types of mapping output into different
                                SAM files.

                For detail input: gsnap --help, read --split--output.
                  --split-output=STRING   Basename for multiple-file output, separately for nomapping,
                                   halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult,
                                   paired_uniq, paired_mult, concordant_uniq, and concordant_mult results (up to 9 files,
                                   or 10 if --fails-as-input is selected, or 3 for single-end reads)

--picard-mergesamfiles-jar     PICARD MergeSamFiles.jar full path. It will be used to merge all
                SAM or BAM files. MANDATORY parameter.
                Example: --picard-mergesamfiles-jar /usr/local/hadoop/picard-tools-1.56/MergeSamFiles.jar

--picard-sortsam-jar        PICARD SortSam.jar full path. It will be used for SAM BAM conversion. MANDATORY parameter.
                Example: --picard-sortsam-jar /usr/local/hadoop/picard-tools-1.56/SortSam.jar

--mapper-args               Arguments for mapping:
                BWA mapping for aln command:
                    Example --mapper-args "-o 1 -n 0.01 -l 200 -e 12 -d 12"
                    Note: Define BWA parameters correctly according to the version used here.
                TopHat:
                    Example: --mapper-args "--mate-inner-dist 200 --max-multihits 40 --phred64-quals"
                    Note: Define TopHat parameters correctly according to the version used here.
                GSNAP mapping:
                    Example: --mapper-args "--pairexpect 200 --quality-protocol illumina"
                    Note: Define gsnap parameters correctly according to the version used here.
                    For detail about parameters visit [http://research-pub.gene.com/gmap/]
                bowtie mapping:
                    Example: --mapper-args "--sam"
                    Note: Define gsnap parameters correctly according to the version used here.
                    For detail about parameters visit [http://bowtie-bio.sourceforge.net/index.shtml]
                SOAPAlinger:
                    Example: --mapper-args "-m 400 -x 600"
                    Note: Define SOAPaligner parameters correctly according to the version used here.
                    For detail about parameters visit [http://soap.genomics.org.cn/soapaligner.html]
Please note that processor parameters are not required in arguments. For example BWA mapping does not require -t parameter.
This parameter is given by DistMap internally.

--bwa-sampe-args        Arguments for BWA sampe or samse module.
            bwa sampe for paired-end reads
            bwa samse for single-end reads
            Example --bwa-sampe-args "-a 500 -s"

--output-format         Output file format either SAM or BAM.
                Default: BAM

--job-desc          Give a job description which will be displayed in JobTracker webpage.
                Default: <mapper name> mapping.
Hadoop streaming Parameters:

--hadoop-scheduler      Give the scheduler name:
                1) Fair Scheduler (need to configure. http://hadoop.apache.org/docs/stable/fair_scheduler.html)
                2) FIFO (it is default comes with Hadoop)
                3) Capacity Scheduler (need to configure. http://hadoop.apache.org/docs/stable/capacity_scheduler.html)
                If your hadoop has Fair Scheduler then provide a pool name with the parameter --queue-name
                If your hadoop has Capacity Scheduler then provide  queue name with the parameter --queue-name.
                Example: --hadoop-scheduler Fair
                     --hadoop-scheduler Capacity
                     --hadoop-scheduler FIFO

--queue-name            If your hadoop has Capacity Scheduler then provide --queue-name.
                Example: --queue-name pg1
--job-priority          Give the job priority to run your job.
                Possible options: [ VERY_LOW | LOW | NORMAL | HIGH | VERY_HIGH ]
                Example: --job-priority VERY_HIGH

--verbose/--help                    Print inputs on screen.

4 BWA mapping

Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the available human reference sequence.

Step1: Download latest BWA source code from http://sourceforge.net/projects/bio-bwa/files/

Step2: Uncompress the downloaded file bwa-0.6.2.tar.bz2 with following command:

tar -xvjf bwa-0.6.2.tar.bz2

This command will return the folder/directory bwa-0.6.2

Step3: Enter into the uncompressed folder bwa-0.6.2 with following command:

cd /home/user/bwa-0.6.2

Step4: To make BWA executable run the sudo make command:

sudo make

After running the make command an executable called bwa will be created in /home/user/bwa-0.6.2 folder.

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/bwa-0.6.2/bwa

5 GSNAP mapping

GSNAP (Genomic Short-read Nucleotide Alignment Program) is a mapper for RNA-seq data. It has the advantages that it can detect splicing events and is capable of SNP tolerant alignments (Wu & Nacu 2010).

Step1: Download the GSNAP source code from http://research-pub.gene.com/gmap/src/gmap-gsnap-2012-07-20.tar.gz

Step2: Uncompress the downloaded file “gmap-gsnap-2012-07-20.tar.gz” with following command:

tar -xzvf gmap-gsnap-2012-07-20.tar.gz

Step3: Enter into the uncompressed folder gmap-2012-07-20 with following command:

  cd /home/user/gmap-2012-07-20

Step4: To install the gmap package run these four commands:

./configure --prefix=/home/user/gmap-2012-07-20
make
make check   (optional)
make install

The above four commands will build GSNAP and other executables in /home/user/gmap-2012-07-20/bin

Step5: To check the GSNAP installations run this command:

/home/user/gmap-2012-07-20/bin/gsnap –help

This command will return the detailed help manual for various GSNAP parameters if the installation was successful.

Note: For more detailed information how to install GSNAP please read the gmap-2012-07-20/README file.

In /home/user/gmap-2012-07-20 folder you should get following executables:

/home/user/gmap-2012-07-20/bin/gsnap
/home/user/gmap-2012-07-20/bin/gmap_build

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/gmap-2012-07-20/bin/gsnap

6 `TopHat` mapping

Download TopHat binary from http://tophat.cbcb.umd.edu/downloads/tophat-2.0.6.OSX_x86_64.tar.gz for Macintosh or from http://tophat.cbcb.umd.edu/downloads/tophat-2.0.6.Linux_x86_64.tar.gz for Linux operating system.

Get `Bowtie for TopHat` mapping

As TopHat requires bowtie for mapping internally. Bowtie binaries can be found at http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/ and copy following bowtie executables from /home/user/bowtie-0.12.9 folder into /home/user/tophat-2.0.6 folder.

cp /home/user/bowtie-0.12.9/bowtie /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-build /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-build-debug /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-debug /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-inspect /home/user/tophat-2.0.6/
cp /home/user/bowtie-0.12.9/bowtie-inspect-debug /home/user/tophat-2.0.6/

In /home/user/tophat-2.0.6 folder you should get following executables:

/home/user/tophat-2.0.6/prep_reads
/home/user/tophat-2.0.6/tophat

/home/user/tophat-2.0.6/bowtie
/home/user/tophat-2.0.6/bowtie-build
/home/user/tophat-2.0.6/bowtie-build-debug
/home/user/tophat-2.0.6/bowtie-debug
/home/user/tophat-2.0.6/bowtie-inspect
/home/user/tophat-2.0.6/bowtie-inspect-debug

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/tophat-2.0.6/tophat

Note: Since TopHat was developed on Python 2.6, all slaves should have Python 2.6. This uses the python package called getopt which is called in /home/user/tophat-2.0.6/tophat

7 Bowtie mapping

Download Bowtie binaries from http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/

In /home/user/bowtie-0.12.9 folder you should get following executables:

/home/user/bowtie-0.12.9/bowtie
/home/user/bowtie-0.12.9/bowtie-build
/home/user/bowtie-0.12.9/bowtie-build-debug
/home/user/bowtie-0.12.9/bowtie-debug
/home/user/bowtie-0.12.9/bowtie-inspect
/home/user/bowtie-0.12.9/bowtie-inspect-debug

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/bowtie-0.12.9/bowtie

8 Bowtie2 mapping

Download Bowtie binaries from http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.0.6/

In /home/user/bowtie2-2.0.6 folder you should get following executables:

/home/user/bowtie2-2.0.6/bowtie2
/home/user/bowtie2-2.0.6/bowtie2-build
/home/user/bowtie2-2.0.6/bowtie2-build-debug
/home/user/bowtie2-2.0.6/bowtie2-debug
/home/user/bowtie2-2.0.6/bowtie2-inspect
/home/user/bowtie2-2.0.6/bowtie2-inspect-debug

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/bowtie2-2.0.6/bowtie2

9 SOAP mapping

SOAPaligner/soap2 installation

SOAPaligner/soap2 is a short read mapper that allows gapped and un-gapped mapping of short reads.

Step1: Download latest SOAPaligner/soap2 source code from http:// soap.genomics.org.cn/down/SOAPaligner-v2.20-src.tar.gz

Step2: Uncompress the downloaded file SOAPaligner-v2.20-src.tar.gz with following command:

tar -xzvf SOAPaligner-v2.20-src.tar.gz

This command will return the folder/directory soap2.20 Step3: Enter into the uncompressed folder soap2.20 with following command:

    cd /home/user/soap2.20

Step4: To make soap executable run the make command:

    make

After running the make command, an executable called soap will be created in /home/user/soap2.20 folder.

SOAPaligner_builder installation

Running SOAPaligner requires index files for the reference genome, reads can then be searched against the formatted index files.

Step1: Download latest SOAPaligner_builder source code from http://soap.genomics.org.cn/down/SOAPaligner-v2.20-src_builder.tar.gz

Step2: Uncompress the downloaded file SOAPaligner-v2.20-src_builder.tar.gz with following command:

tar –xzvf SOAPaligner-v2.20-src_builder.tar.gz

This command will return the folder/directory soap_builder

Step3: Enter in the uncompressed folder soap_builder with following command:

    cd /home/user/soap_builder

Step4: To make soap executable run the make command:

 make

After running the make command an executable called 2bwt-builder will be created into folder /home/user/soap_builder.

Step5: Copy 2bwt-builder file into /home/user/soap2.20 folder with following command:

cp /home/user/soap_builder/2bwt-builder /home/user/soap2.20/

soap2sam.pl installation:

This Perl script is required to convert soap output format into common SAM format.

Step1: Download latest soap2sam.pl source code from http:// soap.genomics.org.cn/down/soap2sam.tar.gz

Step2: Uncompress the downloaded file soap2sam.tar.gz with following command:

tar –xzvf soap2sam.tar.gz

This command will return the perl script soap2sam.pl

Step3: Copy soap2sam.pl file into /home/user/soap2.20 folder with following command:

cp soap2sam.pl /home/user/soap2.20/

In /home/user/soap2.20 folder you should get following executables:

/home/user/soap2.20/soap2
/home/user/soap2.20/2bwt-builder
/home/user/soap2.20/soap2sam.pl

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/soap2.20/soap2

10 STAR mapping

Download binary of STAR from ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/2.2.0/

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/STAR_2.2.0c.Linux_x86_64/STAR

11 Bismak mapping

Bismark A bisulfite read mapper and methylation caller www.bioinformatics.babraham.ac.uk/projects/bismark/bismark_v0.7.7.tar.gz

Step1: Download the Bismark source code from www.bioinformatics.babraham.ac.uk/projects/bismark/bismark_v0.7.7.tar.gz

Step2: Uncompress the downloaded file “bismark_v0.7.7.tar.gz” with following command:

tar -xzvf bismark_v0.7.7.tar.gz

Step3: Enter the uncompressed folder “bismark_v0.7.7” with following command:

cd bismark_v0.7.7

Get Bowtie for Bismark mapping

As Bismark uses bowtie for mapping internally. Bowtie can be downloaded from http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/, copy following bowtie executables from /home/user/bowtie-0.12.9 folder into /home/user/bismark_v0.7.7 folder.

cp /home/user/bowtie-0.12.9/bowtie /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-build /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-build-debug /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-debug /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-inspect /home/user/bismark_v0.7.7/
cp /home/user/bowtie-0.12.9/bowtie-inspect-debug /home/user/bismark_v0.7.7/

In /home/user/bismark_v0.7.7 folder you should get following executables:

/home/user/bismark_v0.7.7/bismark
/home/user/bismark_v0.7.7/bismark_genome_preparation
/home/user/bismark_v0.7.7/bismark_methylation_extractor
/home/user/bismark_v0.7.7/bowtie
/home/user/bismark_v0.7.7/bowtie-build
/home/user/bismark_v0.7.7/bowtie-build-debug
/home/user/bismark_v0.7.7/bowtie-debug
/home/user/bismark_v0.7.7/bowtie-inspect
/home/user/bismark_v0.7.7/bowtie-inspect-debug

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/bismark_v0.7.7/bismark

12 BSMAP mapping

BSMAP is a short reads mapping software for bisulfite sequencing reads. Bisulfite treatment converts unmethylated Cytosines into Uracils (sequenced as Thymine) and leave methylated Cytosines unchanged, hence provides a way to study DNA cytosine methylation at single nucleotide resolution. BSMAP aligns the Ts in the reads to both Cs and Ts in the reference.

Step1: Download the BSMAP source code from http://bsmap.googlecode.com/files/bsmap-2.73.tgz

Step2: Uncompress the downloaded file “bsmap-2.73.tgz” with following command:

tar -xzvf bsmap-2.73.tgz

Step3: Enter the uncompressed folder “bsmap-2.73” with following command:

cd bsmap-2.73

Step4: To install the BSMAP package run these two commands:

sudo make
sudo make install

The above four commands will build BSMAP executable in “/usr/bin”

Step5: To check the BSMAP installations run this command:

bsmap –h

This command will return the detailed help manual for various BSMAP parameters if the installation was successful.

Note: For more detailed information how to run BSMAP, please read the bsmap-2.73/README.txt file.

Give the full path for DistMap command parameter

perl DistMap_v_1.0/distmap --mapper-path /home/user/bsmap-2.73/bsmap

Discussion

Comment has been marked as spam.
Undo

View and moderate all "wiki Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Wiki"

Anonymous - 2013-12-16

Originally posted by: mbel... (code.google.com)@gmail.com

It would be great to add sample data/commands here. The download links for sample data in the PDF manual are broken. Thanks!

Originally posted by: [mbel...@gmail.com](http://code.google.com/u/101124129455514098424/) It would be great to add sample data/commands here. The download links for sample data in the PDF manual are broken. Thanks!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "wiki Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Wiki"

Anonymous - 2013-12-20

Originally posted by: nid... (code.google.com)@gmail.com

Fantastic documentation. thanks.

Nidhan

Originally posted by: [nid...@gmail.com](http://code.google.com/u/115987397733611367099/) Fantastic documentation. thanks. Nidhan

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

distmap Wiki

A toolkit for distributed short read mapping

Manual

1. Introduction

DistMap components

Module1: Genome indexing:

Module2: Data processing:

Module3: Data upload:

Module4: Data mapping:

Module5: Data download:

Module6: Post processing output:

Module7: Data cleanup:

2. System Requirements

3. How to run DistMap?

DistMap command line parameters

4 BWA mapping

5 GSNAP mapping

6 TopHat mapping

Get Bowtie for TopHat mapping

7 Bowtie mapping

8 Bowtie2 mapping

9 SOAP mapping

SOAPaligner/soap2 installation

SOAPaligner_builder installation

soap2sam.pl installation:

10 STAR mapping

11 Bismak mapping

Get Bowtie for Bismark mapping

12 BSMAP mapping

Discussion

`DistMap` components

3. How to run `DistMap`?

`DistMap` command line parameters

6 `TopHat` mapping

Get `Bowtie for TopHat` mapping