Menu

Home

parapanda566

COVA: Comparison of Variants and functional annotation for next-generation sequencing

What is COVA?

It’s a variant annotation and comparison tool for next-generation sequencing. It annotates the effects of variants on genes and compares those among multiple samples, which helps to pinpoint causal variation(s) relating to phenotype.

Features

  • Comparable: You can compare variants among multiple samples.
  • Multiple species: Supports multiple codon tables.
  • Structural variants: Annotates structural variants.

Typical usage

Input: The inputs are predicted variants (SNPs, insertions, deletions and structural variants). The input file is usually obtained as a result of a sequencing experiment, and COVA can annotate the following variant file format: SAMtools-pileup, VCF, MAQ, BreakDancer and GFF3 formatted coverage of gene generated by coverageBed.

Output: COVS analyzed the input variants. It annotates the effects of variants on genes and compares those among multiple samples. Output files are comma-delimited file (CSV), so you can analyze results in Excel.

Getting Started

Availability and requirements

  • Operating system: Platform Independent. Tested on Mac OS X and Red Hat Linux.
  • Programming language: Ruby 1.8.7 or 1.9.x
  • Other requirements: RubyGems package management software and the following libraries: BioRuby 1.4.x.
  • License: MIT
  • Any restrictions to use by non-academics: None

Installation

Before you use COVA, you have to install BioRuby as the following command:

gem install bio

You can download the program from the “Files” page. Then you have to uncompress the ZIP file and copy the contents of the ZIP file to wherever you want the program install. If you have a Unix or a Mac system, the command line would be:

unzip COVA_version.zip
mv COVA_version /path/to/install

The install can be tests by running the following command. This should print the list of available options.

ruby /path_to_COVA/cova.rb -h

Preparation of reference files

COVA can utilize annotation data sets conforming to Genbank Format which is easily downloadable from NCBI website. Once you downloaded Genbank file(s), you have to instruct the program which reference files you use. Create the tab-delimited text file 'reflist.txt' and add the content below.

#Chr    Codon    Path
chr     11       /path_to_genbank /NC_000964.gbk
p1      11       /path_to_genbank /NC_0009xx.gbk
p2      11       /path_to_genbank /NC_0009xx.gbk

Chromosome names should be corresponding to those of variant files.
You have to specify the genetic codon table number in each chromosome.
Table 11 is used for Bacteria, Archaea, prokaryotic viruses and chloroplast proteins. You can see the detail information at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Preparation of variant files

COVA supports the following variant file type:

Format File type*
VCF genotype calling format (.vcf file) vcf
SAMtools genotype-calling pileup format (.pileup file) pileup
MAQ genotype calling format (cns.snp) maq
BreakDancer structural variations format breakd
GFF3 format coverage of gene generated by coverageBed gffcov

*File type is specified in ‘varlist.txt’ described below.

Once you prepared variant file(s), you have to instruct the program which variant files does each sample correspond to. You can specify plural type of variant file for each sample, and you can specify plural samples. Create the tab-delimited text file 'varlist.txt' and add the content below.

#Name   Filetype Path
wt      pileup  /path_to_file/wt.pileup
wt      breakd  /path_to_file/wt.breakdancer
wt      gffcov  /path_to_file/wt.coverage.gff
mut1    pileup  /path_to_file/mut1.pileup
mut1    breakd  /path_to_file/mut1.breakdancer
mut1    gffcov  /path_to_file/mut1.coverage.gff
mut2    pileup  /path_to_file/mut2.pileup
mut2    breakd  /path_to_file/mut2.breakdancer
mut2    gffcov  /path_to_file/mut2.coverage.gff

The first sample (‘wt’ in this example) is recognized as the parental sample. When COVA compare variants among all samples, the variations being common to the parental sample are flagged.

Get it started

Once two tab-delimited text file 'reflist.txt' and 'varlist.txt' are available, you can annotate and compare variant files:

ruby /path_to_COVA/cova.rb -o outdir -r reflist.txt -v varlist.txt

Output files will be generated in ‘outdir’ directory.

Output files

Output file 1 (annotated variant files of each sample)

The first file contains annotation for all variants (SNP/InDel), such as type and probability of variant and it’s effect of gene. This file is generated from vcf/pileup variant file for all samples. This file is comma-delimited file, so you can open this file in Excel.

The format is comma-separated columns:

Column Notes
Source Sample name
Chr Chromosome name
Pos One based position
VarType Type of variant {snp, insertion, deletion}
Refbase Reference base. If AltBase is indel, RefBase is shown as *
Altbase Alternate non-reference alleles. It only takes the strongest non-reference allele.
Qual Pileup: consensus quality VCF: VCF’s QUAL score Maq: consensus quality
NumRead The number of reads covering the site
IsHetero TRUE: if the variant is heterozygous FALSE: if the variant is homozygous
RatioAllele The fraction of reads supporting variant allele
VarInfo Filter information especially for VCF
VarSpectrum6 Variant spectrum shown in six types.
VarSpectrum12 Variant spectrum shown in 12 types.
VarSpectType Transition, transversion
Subtract The same variant of the parental sample is flagged as TRUE. The parental strain is specified as the first sample of ‘varlist.txt’
PosType Type of position defined in Genbank feature {intergenic, upstream, downstream, CDS, rRNA, tRNA, ncRNA, etc.}
FeatureCoord Relative position with in a feature. In up/downstream, a distance from neighboring feature.
FeatureCoordProp Proportion of FeatureCoord.
CodonCoord Position of codon in CDS
CodonLetter {1,2,3}
ReferenceCodon Codon of reference
ReferenceAA Amino acid of refernece
VariantCodon Codon of variant allele
VariantAA Amino acid of variant allele
ChangesAA non-synonymous mutation (include InDel) : TRUE, synonymous mutation: FALSE
FunctionalClass Effect of variant in CDS. {silent, missense, out/in-frame insertion/deletion}
LocusTag Locus tag from Genbank file
GeneName Gene name from Genbank file
Product Product from Genbank file
Note Note qualifier from Genbank file

Output file 2 (comparison of variants among samples)

The second file is a comparison table of variants (SNP/InDel) among multiple samples, which helps to pinpoint causal variation(s) relating to phenotype. This file is generated from vcf/pileup variant file from all samples. This file contains annotated variants from all samples, such as type and probability of variant and it’s effect of gene. The variants observed in the parental strain are flagged, so you can subtract parental variations. This file is comma-delimited file, so you can open this file in Excel.

The format is comma-separated columns:

Column Notes
Chr Chromosome name
Pos One based position
PosType Type of position defined in Genbank feature {intergenic, upstream, downstream, CDS, rRNA, tRNA, ncRNA, etc.}
VarType Type of variant {snp, insertion, deletion}
Subtract The same variant of the parental sample is flagged as TRUE. The parental strain is specified as the first sample of ‘varlist.txt’
IncludeHetero TRUE: if at least one sample include heterozygous variant
AllSameVar TRUE: if all sample include the same variant
NumSample Number of samples having given variant
ChangesAA Is amino acid changed? {TRUE, FALSE}
LocusTag Locus tag from Genbank file
GeneName Gene name from Genbank file
Product Product from Genbank file
Note Note qualifier from Genbank file
SampleName_V Annotated variant is reported in each sample.
Snp in CDS:
T62N (185C>A):
OldAA_AApos_NewAA (NApos_OldNA>NewAA)
* means a stop codon, # means a non-synonymous change. % means a heterogeneous variant.
Insertion/Deletion in CDS (frame shift variant):
#509FS (1525>-A):
AApos (NApos>Ins/Del)
Variant in non-coding:
SNP: 5A>C: NApos_OldNA>NewAA
Insertion/Deletion: 5>-C: NApos> Ins/Del
SampleName_R Read coverage and variant fraction in each variant.
70 (0.92): NumReads (Fraction of variant)

Output file 3 (comparison of structural variants among samples)

The third file is a comparison table of structural variants among multiple samples, which helps to pinpoint causal variation(s) relating to phenotype. This file is generated from Breakdance output file from all samples. This file consists of one line per region (gene and up/down stream) including a structural variant. This file contains annotations and type and probability of structural variant. The regions observed in the parental strain are flagged, so you can subtract parental variations. This file is comma-delimited file, so you can open this file in Excel.

The format is comma-separated columns:

Column Notes
Chr Chromosome name
Start Start position of feature
End End position of feature
Strand Strand of feature
PositionType Type of position defined in Genbank feature {intergenic, upstream, downstream, CDS, rRNA, tRNA, ncRNA, etc.}
LocusTag Locus tag from Genbank file
GeneName Gene name from Genbank file
Product Product from Genbank file
Note Note qualifier from Genbank file
Parent TRUE: if the parental sample has a variant in the given feature
NumSample Number of samples having given SV in the feature
SampleName Structural variant information from BreakDancer
2653331-2701351:DEL(48081:99); Pos1-Pos2:Type(Size:Score)
Same value appears twice in first and second position.

Output file 4 (comparison of gene coverage among samples)

The forth file is a comparison table of gene coverage among multiple samples, which helps to find deletion of genes. This file is generated from coverageBed output file from all samples. This file consists of one line per region (CDS, rRNA, tRNA and etc.). This file contains annotations and gene coverage. Coverage of gene means the fraction of gene that is covered by reads. This value range from 0.0 to 1.0. This file is comma-delimited file, so you can open this file in Excel.

The format is comma-separated columns:

Column Notes
Chr Chromosome name
Start Start position of feature
End End position of feature
Strand Strand of feature
PositionType Type of position defined in Genbank feature {intergenic, upstream, downstream, CDS, rRNA, tRNA, ncRNA, etc.}
LocusTag Locus tag from Genbank file
GeneName Gene name from Genbank file
Product Product from Genbank file
Note Note qualifier from Genbank file
NumZeroCoverage Number of samples whose coverage of gene is zero (deletion gene)
SampleName The fraction of bases covered by reads in each gene. 1 mean given gene is fully covered by reads. 0 means given gene is not covered by any reads, so this gene seems to be deletion.

Output file 5 (summary of number of variants in each sample)

The fifth file is a summary table of number of variants in each sample. This summaries how many homozygous or heterozygous variants (snp/insertion/deletion) do effect various genome features.

The term "upstream" is defined as 100bp away from a translation start codon and the term "downstream" is defined as 50bp away from a stop codon. The -- upstream_len and -- downstream_len threshold can be used to adjust these thresholds. The value of the first column of each sample named “-all” is all variants, and the second value of each sample named “-sub” is the number of variations after subtraction of the parental sample, which is specified as the first sample in ‘varlist.txt’.
Whether a given variant is classified homozygous or heterozygous is depending on the fraction of read congaing variant calls. The --het_min_ratioAllele and --hom_min_ratioAllele threshold can be used to adjust these thresholds.

The format is comma-separated columns:

Column Notes
hom-snp The number of homozygous snps.
het-snp The number of heterozygous snps.
hom-ins The number of homozygous insertions.
het-ins The number of heterozygous insertions.
hom-del The number of homozygous deletions.
het-del The number of heterozygous deletions.
hom-intergenic The number of homozygous variations in intergenic.
het-intergenic The number of heterozygous variations in intergenic.
hom-upstream The number of homozygous variations in upstream.
het-upstream The number of heterozygous variations in upstream.
hom-downstream The number of homozygous variations in downstream.
het-downstream The number of heterozygous variations in downstream.
hom-synonymous The number of homozygous synonymous variations in CDS.
het-synonymous The number of heterozygous synonymous variations in CDS.
hom-nonsynonymous The number of homozygous nonsynonymous variations (snp or frame-shift insertion or deletion) in CDS.
het-nonsynonymous The number of heterozygous nonsynonymous variations (snp or frame-shift insertion or deletion) in CDS.