Menu

Home

Junho Kim

Vecuum

Vecuum is a Java based variant caller designed for detecting contamination-induced point mutations from hybrid-capture-based genome sequencing data (e.g. WGS, WES, targeted capture, etc). Vecuum is specialized for identifying false variants caused by recombinant vector contamination, however, can be applied to detect spurious calls for various external contaminants such as xenogeneic genomes, cDNA libraries, or even pseudogenes.
The important features of Vecuum are:

  • Estimation of the genomic location of vector-contaminated regions
  • Identification of false variants originated from vector inserts

Get the most recent version of Vecuum here:



Update note (Last updated 02-08-2016)

Vecuum-1.0.1 (02-08-2016) *Minor Update

Improvements:
  1. GRCh38 is now supported with '-a' option.
  2. The version of prerequisite softwares is updated. Previous version of SAMtools (0.1.x) can also be used if the path contains version information (e.g. -S /utils/samtools-0.1.19/samtools)
  3. A few minor bugs including output file naming, presenting transcript ID rather than gene ID, typos in result files have been fixed.
  4. Added "Reccomendation" sections for the first-time users. We highly recommend to use '-A -k' options to analyze entire genome, rather default setttings. A few false variants can be observed regardless of contamination due to the processed pseudogenes.

Prerequisite softwares

Installation of Vecuum

Vecuum was developed using JAVA JDK 7 64bit. To run Vecuum, Java Runtime Environment (JRE) version 1.7.x or later is required. Download the most recent version of Vecuum program in the code page and extract the gzipped archive.

tar -zxvf Vecuum-1.x.y.tar.gz

It contains the followings:

  1. Vecuum.jar
    : Executable JAR file of Vecuum
  2. README.txt
    : Nothing much in this file currently. Please refer to this page instead.
  3. lib/
    : directory containing other JAR libraries for Vecuum
  4. database/
    : directory containing database files for Vecuum

Running Vecuum

To run Vecuum, you can simply run the JAR executable file like below:

java -jar Vecuum.jar

The -h or -? options will bring the following usage of Vecuum. If you see this, you are ready to run Vecuum.


Vecuum: Contamination-induced false variant caller
Version: 1.0.1
Usage: java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> [options]

Input options:

-r PATH     path for indexed reference genome FASTA [.fa/.fasta]
-b PATH     path for indexed bam file [.bam]
-B PATH     path for bwa (e.g. /usr/bin/bwa)
-S PATH     path for samtools (e.g. /usr/bin/samtools)
-t PATH     path for indexed transcriptome FASTA [.fa/.fasta]
-g PATH     path for reference genome GTF file [.gtf]
-u PATH     path for unmapped bam file [.bam]
-R PATH     path for region of interest list [tab-delimited text (chr   start   end gene_name)]
-V PATH     path for suspicious variant list [tab-delimited text (chr   pos)]
-a STR      Assembly version of reference genome [hg19]

Analysis parameters for vector contamination assessment:

-n INT      Number of threads for bwa-mem [1]
-p INT      padding size [5]
-Q INT      mapQ threshold [30]
-l INT      minMatchLen [20]
-s INT      minVectorInsertSupportCnt [3]
-i INT      minVectorInsertSize [500]
-I INT      maxVectorInsertSize [500000]
-k          Skip vector checking step [must be used with -R or -A option]

Analysis parameters for false variant detection:

-A          Searching false variants for all exons regardless of vector contamination
-q INT      baseQ threshold [20]
-c INT      minVectorBalleleCnt [3]
-C INT      maxSampleBalleleCnt [3]
-f FLOAT    minVectorBAF [0.01]
-F FLOAT    maxSampleBAF [0.01]
-d INT      minSampleDepth [5]

Output options:

-o PATH     path for output directory [bam file directory]
-O          Report vector filtered bam file

Recommendation

If you have a list of genes that can be plausible regions of contamination, we recommend to use -R and -k options to reduce the computation times.

java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> -R <tab-delimeted gene list (chr start end gene_name)> -k

If you want to detect all false variants for entire genome, we highly recommend to use -A and -k options rather than default running.

java -jar Vecuum.jar -r <reference_genome.fa> -b <sample_sorted_indexed.bam> -B <bwa_path> -S <samtools_path> -A -k


Inputs and options

Mandatory inputs:

There are three mandatory inputs for Vecuum.

Input Option Description
Reference sequence -r FASTA formatted reference sequence file. The reference must be BWA indexed.
Sample sequence data -b BAM formatted alignment file for sample. The BAM file must be (coordinate) sorted and indexed.
BWA executable file -B BWA executable file path (absolute path).
SAMtools executable file -S SAMtools executable file path (absolute path).


Options:

There are several options you can give to Vecuum for more accurate running.

Option Default Value Description
-t none (provided in the database folder) FASTA formatted transcriptome file for the detection of ve-reads. The FASTA file must be indexed.
-g none (provided in the database folder) GTF file for exon information
-u Path for the unmapped BAM file (if unmapped reads are separated. e.g. 1KG data)
-R Tab-delimited text file for region of interest [tab-delimited text (chr start end gene_name)]. False variant detection must be processed for these regions regardless of the result of vector contamination assessment.
-V Tab-delimited text file for suspicious variant [tab-delimited text (chr pos)]. Statistics will be reported for these variants regardless of the result of false variant detection.
-a hg19 Assembly version of used reference genome (hg19/GRCh37, hg38/GRCh38)

Analysis parameters for vector contamination assessment:

Option Default Value Description
-n 1 Number of threads for BWA-mem (used for unmapped reads remapping)
-p 5 Padding size from exon junctions for the classification of sample-originated reads.
-Q 30 mapQ threshold
-l 20 Minimum match length for the detection of reads containing vector sequence [minMatchLen]
-s 3 Minimum number of vr-reads for the estimation of vector-contaminated regions [minVectorInsertSupportCnt]
-i 500 Minimum vector insert size [minVectorInsertSize]
-I 500000 Maximum vector insert size [maxVectorInsertSize]
-k Skip the vector contamination assessment step [Must be used with -R or -A option]

Analysis parameters for false variant calling:

Option Default Value Description
-A Search false variants for all exons (Do not limit the search space to the vector-contaminated regions)
-q 20 baseQ threshold
-c 3 Minimum vector B allele count [minVectorBalleleCnt]
-C 3 Maximum sample B allele count [maxSampleBalleleCnt]
-f 0.01 Minimum vector B allele frequency [minVectorBAF]
-F 0.01 Maximum vector B allele frequency [maxSampleBAF]
-d 5 Minimum sample depth [minSampleDepth]

Output options:

Option Default Value Description
-o Input bam file directory Path for output directory
-O Report vector filtered bam file


Output

Vecuum outputs the following files:

  1. BamFileName.total.vector.inserted.pos
    : List of the estimated genomic positions of contaminated regions. This file will be empty if a given sample is contamination-free.
  2. BamFileName.false.variants.call
    : List of final false variant candidates with the probability scores. This file will not be made if a given sample is contamination-free or will be empty if there is no false variants.
  3. BamFileName.vector.inserted.pos
    : Candidate list of vector insert sites.
  4. BamFileName.vector.contaminated.reads
    : List of reads containing vector sequence (vr-reads).
  5. BamFileName.vector.filtered.bam (optional)
    : Plausible vector read-filtered BAM file .
  6. BamFileName.vector.bam (optional)
    : Plausible vector read-containing BAM file .