VoltMR Wiki

Pure java NGS mapping soft run on Hadoop 2.0

Brought to you by: hirokiueda

Voltmanual

VolrMR ver 1.0 Manual

2016/10/25 for ver 1.0

Developer : Hiroki Ueda
Please contact ueda[at]genome.rcast.u-tokyo.ac.jp for any question.
Volt is open source, released as "LGPLv3"

Preparation

1 Prepare jar file. If you are running VoltMR in haoop2.4, download the jar (VoltMR1.0.x.withD.jar) with dependency.

If you are running Volt with another version of Hadoop, please download the jar (VoltMR1.0.x.withoutD.jar) without dependency and pom.xml file. All dependency are written in pom.xml file. edit hadoop version in pom file and generate jar file with right dependency. Alternatively, one can excute VoltMR1.0..x.withoutDependency.jar by setting classpath to all dependency jar file.

2 Create index

For GRCh38+virus, Use build index

https://sourceforge.net/projects/voltmr/files/prebuild_referenceset/GRCh38withVirus/
(download all files under the directory)

For another genome build,

2.1 Prepare 2.bit reference file

You can download from UCSC (ex,http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ ) or make 2bit file
from fast file using faToTwobit tool.
(https://genome.ucsc.edu/goldenpath/help/twoBit.html) tools.

2.2
Prepare dbSNP vcf file download from dbSNP site.
https://www.ncbi.nlm.nih.gov/SNP/

2.3
Prepare fa.alt file generated by BWA-MEM if using ALT-aware mapping

2.4

Prepare tab-delimitated transcriptome coordinate file from

GrCh38 version of 2.2-2.4 are provided in https://sourceforge.net/projects/voltmr/files/reference/
common_all_20151104.vcf, GRCh38_Virus.fa.alt , knownGene.tab respectively.

3 Upload index files

hdfs dfs –put /path/to/ GRCh38withVirus/ /hdfs/path/to/ref/directory/

This reference directory are often access by the program. Increase the replication number of those files so as to each node have these files locally. Since we have 7 node in test environment,

hdfs dfs –setrep –R 7 /hdfs/path/to/ref/directory

4 Upload fastq file Also, upload fastq files.

hdfs dfs –put /path/to/fastq/directory /hdfs/path/to/some/fastq/directory

we assumed pair read sequence, so file have to have read1 and read2 set. Also, we assumed, each fastq file is generated for each 1 million to 4 million reads that is standard output for today’s illumine pipeline. If fastq file is not separated for 1 to 4 million, please separate fastq file.

5 Setting Yarn

Setting Yarn environment so as to, each mapper and reducer use 15-20GB memory and 5 to 8 core of CPU. By editing “mapred-site.xml” in hadoop directory. Below is example setting of “mapred-site.xml”

mapreduce.map.memory.mb 18432
mapreduce.reduce.memory.mb 18432
mapreduce.job.maps 21 (increase to adequate number of mapper ) mapreduce.job.reduces 21 (increase to adequate number of reducer ) mapreduce.map.java.opts -Xms10240m -Xmx17408m (this should less than mapreduce.map.memory.mb) mapreduce.reduce.java.opt -Xms10240m -Xmx17408m mapreduce.map.cpu.vcores 5 (5 to 8 according to env) mapreduce.reduce.cpu.vcores 5 (5 to 8 according to env)

Also, set yarn-site.xml so as to set mapper and reducer could run on YARN container.

Example,

yarn.nodemanager.resource.memory-mb 56320 yarn.scheduler.maximum-allocation-mb 56320
yarn.nodemanager.resource.cpu-vcores 15

Above example , the value we used on the cluster with , 7node x 16core CPU 64GB RAM. Optimal setting may vary for different environment but please set mapper and reducer so as to that can use 15-20GB memory and 5 to 8 core of CPU.

Running VoltMR

Prepare input fqlist file VoltMR need to know, location of fastq file, and which fq are pair, and reads group

Create tsv files as follow and upload to hdfs.

Example,

numberOfrow fq1 fq2 trimLowQualRead sampleID readgroupid library sample instrument TorN

1 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
2 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
3 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
4 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
5 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N

DNA mapping (mapping, realignment,recal,sort,remove duplicate)

hadoop jar /localpath/to/VoltMR.xx.jar map –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread

for adam output, use –format adam option

currently ADAM output is experimental option for DNA mapping only, maybe, extended to RNAmapping upon request.

DNA mapping (mapping, realignment,recal,sort,remove duplicate)
to and pileup

hadoop jar /localpath/to/VoltMR.xx.jar mapToPileup –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread

RNA mapping (map,realignment,sort)

hadoop jar /localpath/to/VoltMR.xx.jar mapRNA –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread

Run VoltMR locally

For, non Hadoop environment, one can run VoltMR locally. Prepare fqlist file as illustrated with local path, and

Java –jar VoltMR0.9.x.jar map_localmode –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread

Note this is only a
local process that do only mapping job and produce unsored BAM file.

Wiki: Home