The toolkit contains a number of miscellaneous programs. This page describes usage of two interesting ones, for calculating error rates and for manipulating base quality values.
The errors tool estimates base-calling error rates. The program assumes that variant candidates with very low penetration in highly covered regions are errors. It can be run using
java -jar Bamformatics.jar errors --bam myalignment.bam [options]
The output is in the form of a table printed on-screen with rate estimates for every base substitution (A->C, A->T, etc.)
Importantly, the program automatically ignores certain reads or portions thereof as specified by the common options. This ensures that the estimated error rates reflect data actually used during variant calling.
Apart from options from the common set, the program can also be tuned via
--maxallelic - only consider sites where the proportion of reads with an alternative base is smaller than a threshold (default 0.03),
--maxerrordepth - only consider sites where the number of reads with an alternative base is smaller than a threshold (default 3).
--notstranded - this toggles how errors are computed from reads aligned to the negative strand. By default, errors are reported for substitutions in the read sequence. With this option, errors are reported for substitutions as they would appear in a vcf file.
Furthermore, the program can be directed to compute error rates on specific genomic regions or sites.
Note: This program is meant to estimate the error rate due to random processes. However, high-throughput data can also contain errors of systematic nature, for example due to sequence patterns around the error locus.
It has been argued that base qualities carry much less information than implied by the range of their values (see e.g. Kozanitis et al (2011) J. Comp. Biol. 18, p.401-413). Indeed, it has been argued that binning base qualities into two buckets can provide advantages in alignment file storage (lossy compression) while affecting downstream analysis of data only in a minimal fashion.
Partly because of these observations, the Bamformatics variant caller does not use base quality values beyond the low/high distinction. Thus, it is possible to simplify the representation of base-qualities in alignment files without affecting variant calls produced by the Bamformatics caller. The adjustment can be run using
java -jar Bamformatics.jar noquals
--bam myalignment.bam --output myalignment.noquals.bam [options]
This step can reduce the file size of an alignment even by two-fifths.
Note: The removal of quality values from an alignment is irreversible.