Bamformatics Wiki

Toolkit and GUI for sequencing data analysis

Brought to you by: tkonopka

Variants

The Bamformatics toolkit contains several programs related to the identification and characterization of genetic variants in biological samples.

Calling

Calling of variants starting from an alignment (bam) file is performed using the command

java -jar Bamformatics.jar callvariants
--bam alignment.bam --output mycalls.vcf.gz options

The last field, options, refers to details/thresholds used during variant calling. These options can be skipped as long as a [default] reference genome has been set. To see a complete list of the available options, run the callvariants tool without any arguments.

By default, the caller reports variants in moderately-to-well mappable regions, requires several high quality reads to document the variant, low strand bias, etc. The caller handles several idiosynchracies such as overlapping reads. It does not assume a ploidy model, so it can be used to process data from haploid, diploid, or polyploid samples, including mixtures. During indel calling, the caller performs some local realignment. The output is compatible with the vcf format.

Comment: Several other variant calling programs are available (see [Resources]). However, the Bamformatics caller includes some interesting and distinguishing features (see [Features]).

Comment: The variant call quality scores output by this program are positive real numbers wherein large values indicate higher call confidence. However, the values are not phred representations of p-values and should not be interpreted as such.

Comment: The GT field in the last column of the vcf output is indicative of whether a variant is hetero- or homo-zygous. However, no attempt is made to estimate the actual variant haplotype, i.e. parental origin of heterozygous variants.

Comment: The output is automatically compressed if the extension ends in .gz or .bz2

Annotation

Output obtained from the calling procedure is independent of any database, e.g. dbSNP. To incorporate such annotations into a table of called variants, use the command

java -jar Bamformatics.jar annotatevariants
--vcf mycalls.vcf.gz –-output annotatedcalls.vcf.gz –-database /path/to/dbSNP.vcf.gz

Comment: This command replaces existing ID and INFO columns with values from the annotation database.

Comment: In cases where a position in the variant table matches a position in the database, but the variant information does not (e.g. replacement A to G in vcf file vs. A to C in database), the ID code of the variant is put in square brackets (e.g. [rs00000])

Filtering

Output obtained from the calling procedure should be considered in raw form and contains “.” in the FILTER column. The raw calls can be filtered in two distinct ways, by key/threshold pair or by genomic region.

To filter by a key/threshold pair, use a command such as

java -jar Bamformatics.jar filtervariants
--vcf calls.vcf.gz --output calls.filtered.vcf.gz 
--filter strand --key “SF>12”

Here the string strand is the label used in the FILTER column to identify relevant variants. The string passed to the key option instructs the program to detect variants wherein the value of the SF is greater than 12. In this particular example, the program flags variants wherein the strand fisher test gives a p-value with phred score above 12.

To filter by genomic region, use a command such as

java -jar Bamformatics.jar filtervariants
--vcf calls.vcf.gz –-output calls.filtered.vcf.gz
--filter repeats --bed /path/to/bed/repeats.bed.gz

Here, the last argument should be a definition of a genomic region in bed format. In this particular example, the aim is to flag variants in repetitive regions of the genome using the string “repeats”.

Comment: Multiple filters can be applied to a vcf file simultaneously – just specify several filter/key or filter/bed arguments in the same command. Make sure, however, to specify names of key-based filters before region-based filters.

Variant details

For custom post-processing of variant calls, a separate program can extract more information about variants than is encoded in the vcf file. In particular, this program can make explicit the read counts used in determining variant calls/qualities.

To obtain tables with variant details, use

java -jar Bamformatics.jar variantdetails
--output /path/to/details
--label mysample –-vcf mycalls.vcf.gz –-bam myalignment.bam
options

Here, the string /path/to/details is interpreted as a prefix for output files, which will include a log file, a summary file, and two tables, one for single-base substitution variants and one for indels. The vcf and bam arguments should refer to matching call and alignment files, and the label field should be a short string describing the sample – it will be used to identify columns in the output tables.

The options refers to the same options that are used during variant calling (see previous sections). For convenience and to promote consistency, [default] values for these options can be set separately.

Comment: The program can merge information about variants from several samples – just specify several label/vcf/bam sets.

The variant details can be used to find differences in related samples, for example somatic mutations. Because the data is presented in a clear tabular format, all control over mutation calling and scoring is left to the follow-up analysis. For example, this R function scores changes between samples, for example somatic mutations.

Wiki: Features
Wiki: Home
Wiki: Resources
Wiki: default

Bamformatics Wiki

Toolkit and GUI for sequencing data analysis

Variants

Variants

Calling

Annotation

Filtering

Variant details

Related