VolrMR ver 1.0 Manual
2016/10/25 for ver 1.0
Developer : Hiroki Ueda
Please contact ueda[at]genome.rcast.u-tokyo.ac.jp for any question.
Volt is open source, released as "LGPLv3"
Preparation
1 Prepare jar file. If you are running VoltMR in haoop2.4, download the jar (VoltMR1.0.x.withD.jar) with dependency.
If you are running Volt with another version of Hadoop, please download the jar (VoltMR1.0.x.withoutD.jar) without dependency and pom.xml file. All dependency are written in pom.xml file. edit hadoop version in pom file and generate jar file with right dependency. Alternatively, one can excute VoltMR1.0..x.withoutDependency.jar by setting classpath to all dependency jar file.
2 Create index
For GRCh38+virus, Use build index
https://sourceforge.net/projects/voltmr/files/prebuild_referenceset/GRCh38withVirus/
(download all files under the directory)
For another genome build,
2.1 Prepare 2.bit reference file
You can download from UCSC (ex,http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ ) or make 2bit file
from fast file using faToTwobit tool.
(https://genome.ucsc.edu/goldenpath/help/twoBit.html) tools.
2.2
Prepare dbSNP vcf file download from dbSNP site.
https://www.ncbi.nlm.nih.gov/SNP/
2.3
Prepare fa.alt file generated by BWA-MEM if using ALT-aware mapping
2.4
Prepare tab-delimitated transcriptome coordinate file from
GrCh38 version of 2.2-2.4 are provided in https://sourceforge.net/projects/voltmr/files/reference/
common_all_20151104.vcf, GRCh38_Virus.fa.alt , knownGene.tab respectively.
3 Upload index files
hdfs dfs –put /path/to/ GRCh38withVirus/ /hdfs/path/to/ref/directory/
This reference directory are often access by the program. Increase the replication number of those files so as to each node have these files locally. Since we have 7 node in test environment,
hdfs dfs –setrep –R 7 /hdfs/path/to/ref/directory
4 Upload fastq file Also, upload fastq files.
hdfs dfs –put /path/to/fastq/directory /hdfs/path/to/some/fastq/directory
we assumed pair read sequence, so file have to have read1 and read2 set. Also, we assumed, each fastq file is generated for each 1 million to 4 million reads that is standard output for today’s illumine pipeline. If fastq file is not separated for 1 to 4 million, please separate fastq file.
5 Setting Yarn
Setting Yarn environment so as to, each mapper and reducer use 15-20GB memory and 5 to 8 core of CPU. By editing “mapred-site.xml” in hadoop directory. Below is example setting of “mapred-site.xml”
mapreduce.map.memory.mb 18432
mapreduce.reduce.memory.mb 18432
mapreduce.job.maps 21 (increase to adequate number of mapper ) mapreduce.job.reduces 21 (increase to adequate number of reducer ) mapreduce.map.java.opts -Xms10240m -Xmx17408m (this should less than mapreduce.map.memory.mb) mapreduce.reduce.java.opt -Xms10240m -Xmx17408m mapreduce.map.cpu.vcores 5 (5 to 8 according to env) mapreduce.reduce.cpu.vcores 5 (5 to 8 according to env)
Also, set yarn-site.xml so as to set mapper and reducer could run on YARN container.
Example,
yarn.nodemanager.resource.memory-mb 56320 yarn.scheduler.maximum-allocation-mb 56320
yarn.nodemanager.resource.cpu-vcores 15
Above example , the value we used on the cluster with , 7node x 16core CPU 64GB RAM. Optimal setting may vary for different environment but please set mapper and reducer so as to that can use 15-20GB memory and 5 to 8 core of CPU.
Running VoltMR
Create tsv files as follow and upload to hdfs.
Example,
1 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
2 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
3 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
4 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
5 /hdfs/path/to/fq1 /hdfs/path/to/fq2 true sample01 rg1 lib1 sample1 illumina N
DNA mapping (mapping, realignment,recal,sort,remove duplicate)
hadoop jar /localpath/to/VoltMR.xx.jar map –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread
for adam output, use –format adam option
DNA mapping (mapping, realignment,recal,sort,remove duplicate)
to and pileup
hadoop jar /localpath/to/VoltMR.xx.jar mapToPileup –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread
RNA mapping (map,realignment,sort)
hadoop jar /localpath/to/VoltMR.xx.jar mapRNA –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread
For, non Hadoop environment, one can run VoltMR locally. Prepare fqlist file as illustrated with local path, and
Java –jar VoltMR0.9.x.jar map_localmode –in /hdfs/path/to/input/fqlist –index /hdfs/path/to/ref/directory –out /hdfs/out/dir –t numThread
Note this is only a
local process that do only mapping job and produce unsored BAM file.