Vecuum is a Java based variant caller designed for detecting contamination-induced point mutations from hybrid-capture-based genome sequencing data (e.g. WGS, WES, targeted capture, etc). Vecuum is specialized for identifying false variants caused by recombinant vector contamination, however, can be applied to detect spurious calls for various external contaminants such as xenogeneic genomes, cDNA libraries, or even pseudogenes.
The important features of Vecuum are:
Get the most recent version of Vecuum here:
Vecuum was developed using JAVA JDK 7 64bit. To run Vecuum, Java Runtime Environment (JRE) version 1.7.x or later is required. Download the most recent version of Vecuum program in the code page and extract the gzipped archive.
tar -zxvf Vecuum-1.x.y.tar.gz
It contains the followings:
To run Vecuum, you can simply run the JAR executable file like below:
java -jar Vecuum.jar
The -h or -? options will bring the following usage of Vecuum. If you see this, you are ready to run Vecuum.
Vecuum: Contamination-induced false variant caller
Version: 1.0.1
Usage: java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> [options]
Input options:
-r PATH path for indexed reference genome FASTA [.fa/.fasta]
-b PATH path for indexed bam file [.bam]
-B PATH path for bwa (e.g. /usr/bin/bwa)
-S PATH path for samtools (e.g. /usr/bin/samtools)
-t PATH path for indexed transcriptome FASTA [.fa/.fasta]
-g PATH path for reference genome GTF file [.gtf]
-u PATH path for unmapped bam file [.bam]
-R PATH path for region of interest list [tab-delimited text (chr start end gene_name)]
-V PATH path for suspicious variant list [tab-delimited text (chr pos)]
-a STR Assembly version of reference genome [hg19]
Analysis parameters for vector contamination assessment:
-n INT Number of threads for bwa-mem [1]
-p INT padding size [5]
-Q INT mapQ threshold [30]
-l INT minMatchLen [20]
-s INT minVectorInsertSupportCnt [3]
-i INT minVectorInsertSize [500]
-I INT maxVectorInsertSize [500000]
-k Skip vector checking step [must be used with -R or -A option]
Analysis parameters for false variant detection:
-A Searching false variants for all exons regardless of vector contamination
-q INT baseQ threshold [20]
-c INT minVectorBalleleCnt [3]
-C INT maxSampleBalleleCnt [3]
-f FLOAT minVectorBAF [0.01]
-F FLOAT maxSampleBAF [0.01]
-d INT minSampleDepth [5]
Output options:
-o PATH path for output directory [bam file directory]
-O Report vector filtered bam file
If you have a list of genes that can be plausible regions of contamination, we recommend to use -R and -k options to reduce the computation times.
java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> -R <tab-delimeted gene list (chr start end gene_name)> -k
If you want to detect all false variants for entire genome, we highly recommend to use -A and -k options rather than default running.
java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> -A -k
There are three mandatory inputs for Vecuum.
Input | Option | Description |
---|---|---|
Reference sequence | -r | FASTA formatted reference sequence file. The reference must be BWA indexed. |
Sample sequence data | -b | BAM formatted alignment file for sample. The BAM file must be (coordinate) sorted and indexed. |
BWA executable file | -B | BWA executable file path (absolute path). |
SAMtools executable file | -S | SAMtools executable file path (absolute path). |
There are several options you can give to Vecuum for more accurate running.
Option | Default Value | Description |
---|---|---|
-t | none (provided in the database folder) | FASTA formatted transcriptome file for the detection of ve-reads. The FASTA file must be indexed. |
-g | none (provided in the database folder) | GTF file for exon information |
-u | Path for the unmapped BAM file (if unmapped reads are separated. e.g. 1KG data) | |
-R | Tab-delimited text file for region of interest [tab-delimited text (chr start end gene_name)]. False variant detection must be processed for these regions regardless of the result of vector contamination assessment. | |
-V | Tab-delimited text file for suspicious variant [tab-delimited text (chr pos)]. Statistics will be reported for these variants regardless of the result of false variant detection. | |
-a | hg19 | Assembly version of used reference genome (hg19/GRCh37, hg38/GRCh38) |
Analysis parameters for vector contamination assessment:
Option | Default Value | Description |
---|---|---|
-n | 1 | Number of threads for BWA-mem (used for unmapped reads remapping) |
-p | 5 | Padding size from exon junctions for the classification of sample-originated reads. |
-Q | 30 | mapQ threshold |
-l | 20 | Minimum match length for the detection of reads containing vector sequence [minMatchLen] |
-s | 3 | Minimum number of vr-reads for the estimation of vector-contaminated regions [minVectorInsertSupportCnt] |
-i | 500 | Minimum vector insert size [minVectorInsertSize] |
-I | 500000 | Maximum vector insert size [maxVectorInsertSize] |
-k | Skip the vector contamination assessment step [Must be used with -R or -A option] |
Analysis parameters for false variant calling:
Option | Default Value | Description |
---|---|---|
-A | Search false variants for all exons (Do not limit the search space to the vector-contaminated regions) | |
-q | 20 | baseQ threshold |
-c | 3 | Minimum vector B allele count [minVectorBalleleCnt] |
-C | 3 | Maximum sample B allele count [maxSampleBalleleCnt] |
-f | 0.01 | Minimum vector B allele frequency [minVectorBAF] |
-F | 0.01 | Maximum vector B allele frequency [maxSampleBAF] |
-d | 5 | Minimum sample depth [minSampleDepth] |
Output options:
Option | Default Value | Description |
---|---|---|
-o | Input bam file directory | Path for output directory |
-O | Report vector filtered bam file |
Vecuum outputs the following files: