Home

Virmid

Virmid (Virtual Microdissection for SNP calling) is a Java based variant caller designed for disease-control matched samples. Virmid is also specialized for identifying potential within individual contamination where the disease sample cannot be purified enough. While the SNP calling rate is severely compromised with this heterogeneity, Virmid can uncover SNPs with low allele frequency by considering the level of contamination (alpha). The important features of Virmid are:

  • Estimation of accurate proporation of control sample in a (mixed) disease sample
  • Improved SNP and somatic mutation calling with regard to the estimated proportion

Get the most recent version of Virmid here:

.
.

Update note (Last updated 05-12-2014)

Virmid-1.1.1 (05-12-2014) *Minor Update

Bug fixes:
1. Better handling of directory paths for Windows users.
2. Changes to the header of the VCF files.

Virmid-1.1.0 (11-7-2013) *Major update.

Bug fixes:
1. A major bug with '-v' option has been fixed. (Please note that using mean/stdev coverage for sampling is discouraged in exome sequencing due to the large standard deviation)
2. A major bug related to extremely high coverage regions has been fixed.
3. A few minor bugs including unnecessary dot printing, typos in result files have been fixed.

Changes
1. -c/C options have been changed to -c1/C1/c2/C2 options to give a separated parameters for coverage limit. 'c' is used for minimum coverage, 'C' is for maximum coverage, '1' is for normal and '2' is for disease sample. So 'c1' limits minimum coverage for normal sample, 'C1' limits maximum coverage for normal sample, 'c2' for mimumum disease and 'C2' for maximum disease sample. -c/C options have been deprecated.
2. A new parameter '-M' has been applied to control the "maximum considered read depth for each nucletide in variant calling". This is originally introduced to prevent unexpected memery errors or time delays that result from some abnormal regions; due to several reasons, some regions have been read for >10000 times. The default value for 'M' is 500, which means for each nucleotide at most 500 reads are considered to call variants (usually this is enough). But you can always increase the limit, or make it unlimited (give -1 for M).
3. The output file names are now ".vcf".
4. Results are now separated in two different files - 'all' and 'passed'. The 'all' files are same with previous version. The 'passed' files only contain entries that passed all the virmid filters.

Virmid-1.0.2 (5-29-2013)

Bug fixes:
1. There was a bug with '-w' (setting working directory) option. (We thank Dr. Malachi Griffith in WUSTL!)
2. There was a problem with saving sampling positions to BAM files.
3. There was a minor bug in determining indel proximity of the sampled position.

Other improvements:
1. There was a slight speed improvement.
2. Now, predicted genotypes for germline mutations are written in the final output. The ID in the VCF file is 'gt', and it has 'h'(heterozygous) or 'H'(Homozygous) followed by the probability of the genotype.

Installation of Virmid

Virmid was developed using JAVA JDK 7 64bit. To run Virmid, Java Runtime Environment (JRE) version 1.7.x or later is required. Download the most recent version of virmid program in the code page and extract the gzipped archive.

tar -zxvf virmid-1.x.y.tar.gz

It will generate a new directory virmid-1.x, which contains the followings:

  1. Virmid.jar
    executable JAR file of Virmid
  2. README.TXT
    Nothing much in this file currently. Please refer to this page instead.
  3. lib/
    directory containing other JAR libraries for Virmid

Running Virmid

To run Virmid, you can simply run the JAR executable file like below:

jar -jar Virmid.jar

This will bring the following usage of Virmid. If you see this, you are ready to run Virmid.


Virmid: VIRtual MIcroDissection for sensitive SNP profiling in paired control-disease samples.
Version: 1.00

Usage: java -jar Virmid.jar -R <reference.fa> -D <disease_sample.bam> -N <normal_sample.bam> [options]

Input options:

-w PATH     working directory [directory of disease input]
-a          exit after estimating alpha [false]
-f          do not reuse previous calls/trained models [false]
-r INT      read length [Virmid's guess]
-e INT      edit distance used in the alignment [4]
-t INT      maximum number of threads [1]

Sampling options:

-p  INT      maximum number of sampling points for training [10000000]
-q  INT      minimum mapping quality for sampling points [null]
-c1 INT     minimum depth of coverage for normal sampling points [null]. exclusive use with -v
-C1 INT     maximum depth of coverage for normal sampling points [null]. exclusive use with -v
-c2 INT     minimum depth of coverage for disease sampling points [null]. exclusive use with -v
-C2 INT     maximum depth of coverage for disease sampling points [null]. exclusive use with -v
-v FLOAT    fold of standard deviation for sampling points [0]. exclusive use with -c/-C.
-M  INT     maximum read-depth for consideration in each nucleotide [500] (-1 for unlimited).

Output options:

-o FILE     header of output files [<tumorsample.bam>]

Inputs and options

Mandatory inputs:

There are three mandatory inputs for Virmid.

Input Option Description
Reference sequence -R FASTA formatted reference sequence file. The reference must be indexed.
Disease data -D BAM formatted alignment file for disease sample. The BAM file must be (coordinate) sorted and indexed (e.g. samtools sort)
Normal data -N BAM formatted alignment file for normal (control) sample. The BAM file must be (coordinate) sorted and indexed.

Options:

There are several options you can give to Virmid for more accurate and convenient running.

Option Default Value Description
w (directory containing disease sample) working directory of Virmid. All outputs will be saved in this directory
a flag for alpha only mode: Virmid will be ceased after inferring sample impurity
f flag for forced recalculation of intermediate file: Virmid tries to reuse pre-calculated data (e.g. sampling positions) from 2nd time for faster running. This flag will run Virmid from the top again
r (Virmid's guess) read length
e 4 Maximal edit distance allowed for mapping in alignment
t 1 Maximal number of thread for multiprocessing. Currently the actual performance is not significantly increased (discouraged to use)
p 10000000 Maximum number of sampling points for training prior probabilities of genotype.
q 0 Minimum mapping quality for sampling point selection. Read with worse mapping quality than this will not be used to sample impurity inference
c1 null Minimum depth of coverage for sampling point selection in normal sample. Genomic positions with read depth less than this will not be used to sample impurity inference. This option is exclusively used with -v
C1 null Maximum depth of coverage for sampling point in normal sample. This option can prevent incorrect sample impurity information resulted from other events such as copy number variation or ambiguous mapping. This option is exclusively used with -v
c2 null Minimum depth of coverage for sampling point selection in disease sample. Genomic positions with read depth less than this will not be used to sample impurity inference. This option is exclusively used with -v
C2 null Maximum depth of coverage for sampling point in disease sample. This option can prevent incorrect sample impurity information resulted from other events such as copy number variation or ambiguous mapping. This option is exclusively used with -v
v null Maximum fold change of read depth allowed for sampling point. Genomic positions with read depth bigger than or less than the MeanCoverage±StandardDeviation*v will be ignored for impurity inference. This option is exclusively used with -c and -C

Examples:

java -jar Virmid.jar -R hg19.fa -D tumor.bam -N normal.bam -c1 20 -C1 100 -c2 20 -C2 100

will find all somatic mutations between tumor.bam and normal.bam based on hg19 assembly. Positions with read depth less than 20 and more than 100 will be ignored for sample impurity inference.

java -jar Virmid.jar -R Homo_Sapiens_assembly19.fasta -D brain.bam -N blood.bam -v 3 -af

will try to infer sample impurity of brain.bam data. Virmid will only output sample impurity and stop (-a option). Any previously calculated information will be ignored (-f option). Virmid will calculated mean (μ) and standard deviation (σ) of read depth to filter out any positions with <μ-3σ or >μ+3σ read depth will be ignored.

Output

Virmid outputs the following output files:

  1. $DiseaseSampleName.virmid.som.all.vcf, $DiseaseSampleName.virmid.som.passed.vcf
    List of (filtered) somatic mutation found by Virmid. The output format is standard VCF4.1.
  2. $DiseaseSampleName.virmid.loh.all.vcf, $DiseaseSampleName.virmid.loh.passed.vcf
    List of (filtered) loss of heterozygosity found by Virmid. This is also in VCF4.1 format.
  3. $DiseaseSampleName.virmid.germ.all.vcf, $DiseaseSampleName.virmid.germ.passed.vcf
    List of (filtered) germline variations found by Virmid in VCF4.1 format
  4. $DiseaseSampleName.virmid.report
    Report of basic information in the Virmid running including used parameters, input files, intermediate results, and output destination.

The following files are generated by Virmid for its internal use:

  1. $DiseaseSampleName.virmid.sample.control.bam and $DiseaseSampleName.virmid.sample.disease.bam**
    BAM files of selected positions for sample impurity inference. Virmid automatically tries to find these files to speed up the running. When these file do not exist (e.g. first time running) or Virmid was ran with -f option, Virmid will try to generated these files again.
  2. $DiseaseSampleName.virmid2.gm
    A saved matrix of joint genotype matrix. Virmid will also try to used this for faster running. This file will be also ignored by -f option.

The VCF file

Typical output file looks like below:


##fileformat=VCFv4.1
##filedate=2013/3/28
##source=Virmid
##reference=/data/Reference/hg19/hg19.fa
##INFO=<ID:NDP,Number=1,Type=Integer,Description="Read depth in control sample">
##INFO=<ID:NAC,Number=1,Type=Integer,Description="Allele count in control sample">
##INFO=<ID:DDP,Number=1,Type=Integer,Description="Read depth in disease sample">
##INFO=<ID:DAC,Number=1,Type=Integer,Description="Allele count in disease sample">
##INFO=<ID:mq,Number=1,Type=Float,Description="Mean mapping quality of reference reads">
##INFO=<ID:mqmis,Number=1,Type=Float,Description="Mean mapping quality of mismatch containing reads">
##INFO=<ID:mq30,Number=1,Type=Float,Description="Ratio of reads whose mapping quality is less than 30">
##INFO=<ID:off,Number=1,Type=Integer,Description="Expected position of alternative allele in the read (read length/4)">
##INFO=<ID:offmis,Number=1,Type=Float,Description="Mean end-position of alternative allele in the read">
##INFO=<ID:prx,Number=1,Type=Float,Description="Ratio of reads that are closely located nearby indels">
##INFO=<ID:prxmis,Number=1,Type=Float,Description="Ratio of alternative alleles that are closely located nearby indels">
##INFO=<ID:majorAF,Number=1,Type=Float,Description="Major allele frequency">
##INFO=<ID:bqmis,Number=1,Type=Float,Description="Mean base call quality of alternative alleles">
##INFO=<ID:nm,Number=1,Type=Float,Description="Mean number of mismatches per read">
##INFO=<ID:clip,Number=1,Type=Float,Description="Ratio of soft (hard) clipped read">
##FILTER=<ID=MQ,Descrption="Mean mapping quality of altered allele is significantly bad">
##FILTER=<ID=OFF,Descrption="Altered alleles are located near both ends of reads">
##FILTER=<ID=PRX,Descrption="Indels found nearby location">
##FILTER=<ID=TRI,Descrption="Second allels found in too much of the reads">
##FILTER=<ID=BQ,Descrption="Mean base quality of altered allele is too low">
##FILTER=<ID=NM,Descrption="Number of mismaches per read is too high">
##FILTER=<ID=AC,Descrption="Allele count and/or allele frequency is insignificant">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    762273  .       G       A       100     PASS    NDP=21;NAC=0;DDP=18;DAC=18
chr1    866319  .       G       A       100     PASS    NDP=17;NAC=0;DDP=12;DAC=12
chr1    876499  .       A       G       100     PASS    NDP=36;NAC=0;DDP=31;DAC=31
chr1    877715  .       C       G       100     PASS    NDP=49;NAC=0;DDP=48;DAC=48
chr1    877831  .       T       C       100     PASS    NDP=43;NAC=0;DDP=58;DAC=58
chr1    880238  .       A       G       100     PASS    NDP=28;NAC=0;DDP=24;DAC=24
chr1    880641  .       C       A       41      PRX;AC  NDP=12;NAC=0;DDP=13;DAC=2;prx=0.8;prxmis=0;naf=0;daf=0.15

The general VCF format is well explained in 1000 genomes webpage. For more details of filters and corresponding values, please refer to our paper.

Report file

Virmid outputs .virmid.report file for each run. This contains basic information of the run including estimated impurity (alpha) and the locations of final outputs.

# Running Virmid ver 1.1.0
# Parameters in this run.

Input Files:
Disease sample: 2341T_recal.bam
Mean Coverage: 72.01583333333348
Coverage Standard Deviation: 90.1898035603241
Normal sample: 2341G_recal.bam
Mean Coverage: 36.41749999999998
Coverage Standard Deviation: 45.896614636571385
Reference genome: hg19.fa
Input parameters:
Read Length: 100
Maximum coverage for disease sampling: 120
Minimum coverage for disease sampling: 20
Maximum coverage for normal sampling: 100
Minimum coverage for normal sampling: 10
Minimum mapping quality for sampling: 17
Output parameters:
Working directory: /data/test/
Report file: /data/test/2341T_recal.bam.virmid.report

Estimated alpha: 0.6687711902178711

Output files:
All somatic mutations: /data/test/2341T_recal.bam.virmid.som.all.vcf
Filtered somatic mutations: /data/test/2341T_recal.bam.virmid.som.passed.vcf
All germline mutations: /data/test/2341T_recal.bam.virmid.germ.all.vcf
Filtered germline mutations: /data/test/2341T_recal.bam.virmid.germ.passed.vcf
All loss of heterozygosity: /data/test/2341T_recal.bam.virmid.loh.all.vcf
Filtered loss of heterozygosity: /data/test/2341T_recal.bam.virmid.loh.passed.vcf