LoFreq Wiki

Fast and sensitive variant-calling from sequencing data

Brought to you by: nnnagara, onde

Home

Authors:

There is a newer version of this page. You can find it here.

Welcome to the LoFreq wiki!

Introduction

LoFreq is a fast and sensitive variant-caller for inferring single-nucleotide variants (SNVs) from high-throughput sequencing data. It is designed to robustly call low-frequency variants by exploiting base-call quality values. LoFreq has been used to call rare variants in viral and bacterial sequencing datasets and can be used to study mitochondrial heteroplasmy and rare somatic mutations in heterogeneous tumors.

LoFreq makes full use of base-call qualities, which are usually ignored by other methods or only used for filtering. It is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each SNV call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations. LoFreq is generic and fast enough to be applied to high-coverage data and large genomes. It takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and 1.5 hours to run on a 100X coverage human exome dataset.

For more details see: Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo,
Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd and Niranjan Nagarajan (2012). LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. (Advance access).

Installation

You will need a C compiler and Python 2.6 or 2.7 (including the developer files, i.e. headers etc). Download the source, unpack it and change the working directory to the newly created directory. Assuming you have admin rights, use the following to install LoFreq:

python setup.py install

If you dont't have admin rights or want to install to a non-standard directory use the --prefix flag, e.g.

python setup.py install --prefix $HOME/local/

and make sure the corresponding installation sub-directory is in your PYTHONPATH.

The will list all available options:

python setup.py --help

Note, that you will also need to install samtools to be able to actually use LoFreq.

Usage

The following describes the usage of the most recent version of LoFreq, which is 0.3.2. Older versions of this document are available:

[usage-version-0.3.1]

SNV calling with LoFreq

LoFreq takes a read mapping as input (see the end of this document for notes on short-read mapping). It's a good idea to be as stringent as possible with your mapping and to recalibrate base-call qualities as well. A simple LoFreq call would look like this:

lofreq_snpcaller.py -f ref.fa -b mapping.bam -o raw-snv-output-file

If you want to limit the analysis to certain regions and have those described in a bed-file, then add -l bed-file. In almost all cases you will want to post-process the predicted SNV calls by applying some filtering criteria, for which you should use lofreq_filter.py (see below).

Please note:

The default output format ('snp') is a simple csv-file: the 1st column gives the chromosome, 2nd column: SNV position, 3rd column: SNV-type, 4th column: frequency and the 5th column contains additional information about the SNV call. Alternatively, you produce output in vcf-format (--format vcf), however, some LoFreq scripts do not work with this format at the moment. We are currently migrating to vcf as default.
LoFreq will distinguish between consensus-variants (type:consensus-var) and low-frequency variants (type:low-freq-var). Consensus variants are majority/consensus changes with respect to the reference and do not have quality values assigned (LoFreq is not meant to be a genotyping program). Low-frequency variants arise from subpopulations or non-dominant variants and by definition have an abundance/frequency of <50%.

Options for SNV Calling

lofreq_snpcaller.py -h prints the full help.
Some of the more important options are described in the following:

Regions: If you want to limit the analysis to certain regions then you can pass a bed-file describing those regions to LoFreq with the option -l.
Multiple testing correction: LoFreq calculates a p-value for each SNV call. Multiple testing correction (Bonferroni) is performed automatically by default (using genome-size by 3; see --bonf option). If you expect lots of zero-coverage columns and don't have a bed-file to filter those, then use --bonf auto-ign-zero-cov (available from version 0.4.0 onwards)
Base-call qualities:
- LoFreq uses samtools to read your data. Samtools can correct base-call qualities if indel-errors are likely. This makes sense if you allowed for indels during mapping (default in e.g. BWA). LoFreq uses sensitive BAQ (-E) by default. You can influence this with the --baq option.
- To ignore any base below a certain quality use the option -Q. The default is 3, which is in accordance with Illumina guidelines.
- If your BAM file does not contain base-call quality values or if they are meaningless, you can try LoFreq's quality agnostic SNV calling module (option: --lofreq-nq-on --lofreq-q-off).

SNV filtering

Use lofreq_filter.py to filter SNV predictions produced by LoFreq. The three most common filter options would be:

minimum coverage (--min-cov) and
strand-bias (either --strandbias-bonf or --strandbias-holmbonf; the latter is recommended),
SNV quality (--snp-phred)

An example call would look like this:

lofreq_filter.py --strandbias-holmbonf --min-cov 10 \
    -i raw-snv-file -o filtered-snv-file

Note that SNV quality filtering is largely unnecessary if you used the automatic Bonferroni correction during SNV calling (which is the default).

Somatic SNV Calls / Unique SNV Calls in Sample Pairs

SNVs only called in one sample (e.g. cancer) but not in another paired sample (e.g. blood), can either be biologically interesting or simply due to low coverage in one sample. You can use lofreq_uniq.py to find out whether a call made only in one sample cannot be simply explained by the low coverage in the other (e.g. blood). lofreq_uniq.py takes as minimal input a file listing SNVs predicted in only one sample (see also lofreq_diff.py) and the other sample's BAM file. LoFreq comes with a script that automatically calls SNVs, filters them and finally derives unique SNVs (lofreq_uniq_pipeline.py). An example call looks like this:

lofreq_uniq_pipeline.py --bam1 first.bam --bam2 second.bam \
    --ref ref.fa --bed regions.bed -o output-dir

This pipeline requires a bed-file describing the regions of interest to calculate a Bonferroni factor automatically. You can derive a template for such a file using lofreq_regionbed.py. Output files can be found in output-dir.

Notes on Short Read Mapping

Especially for low frequency SNV calling its best to use very stringent mapping criteria. Only keep properly aligned reads and only allow uniquely mapped reads. To remove non-uniquely mapped reads from BWA output, you can use grep '\(^@\|XT:A:U\)'. To keep only properly aligned reads you should pre-process your BAM file and remove dodgy reads for example with samtools view -F 0x700 -b -o out.bam in.bam.

It's also a good idea to perform a base-call quality calibration on your input, e.g. using GATK. GATK requires a list of known SNVs as input which is not available for most organisms. As an alternative you can simply use a list of all positions above a certain variation threshold (try lofreq_varpos_to_vcf.py). It is also advisable to use GATK for indel realignment if indels are to be expected.

A generic recipe (from reads to SNV calls) would look like this:

Map your reads stringently to the reference genome, e.g. by means of BWA and keep only properly and uniquely mapped reads (anything else will result in mapping artefacts and false positive SNV calls)
Recalibrate base-call error rates with GATK and realign indels.
Predict SNVs with LoFreq (lofreq_snpcaller.py; see above)
Filter your predictions, e.g. require a minimum coverage and remove positions with strand-bias (lofreq_filter.py; see above).

Breseq: full-blown pipeline of which polymorphism-prediction is just a part
SNVer: low frequency SNV calling based on a frequentist approach
V-Phaser: based on similar ideas. Uses phasing for enhanced sensitivity. Focus on viral 454 data.

About

Publication: Please cite Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo,
Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric,
Martin Lloyd Hibberd and Niranjan Nagarajan (2012). LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. (Advance access).
LoFreq was developed in the Genome Institute of Singapore
Sourceforge Admins:

Project Admins:
- Andreas Wilm
- Niranjan Nagarajan

Please feel free to contact us if you find bugs, have suggestions, need help etc. Use the discussion forum, the mailing-list or simply mail us directly.