LoFreq Wiki

Fast and sensitive variant-calling from sequencing data

Brought to you by: nnnagara, onde

Home

Authors:

There is a newer version of this page. You can find it here.

Welcome to the LoFreq wiki!

Introduction

LoFreq is a fast and sensitive variant-caller for inferring single-nucleotide variants (SNVs) from high-throughput sequencing data. It is designed to robustly call low-frequency variants by exploiting base-call quality values. LoFreq has been used to call rare variants in viral and bacterial sequencing datasets and can be used to study mitochondrial heteroplasmy and rare somatic mutations in heterogeneous tumors.

LoFreq makes full use of base-call qualities (and versions >0.5 also read mapping qualities) which are usually ignored by other methods or only used for filtering. It is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each SNV call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations. LoFreq is generic and fast enough to be applied to high-coverage data and large genomes. It takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and 1.5 hours to run on a 100X coverage human exome dataset.

For more details see: Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo,
Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd and Niranjan Nagarajan. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. (www.ncbi.nlm.nih.gov) Nucleic Acids Res. 2012; 40(22):11189-201.

Installation

You will need a C compile, Python 2.7 (including the developer files, i.e. headers etc) and the zlib developer files (all of which are probably already installed on your system). Download the source for the latest LoFreq distribution (using the development version in GIT is not recommended), unpack it and change the working directory to the newly created directory. Assuming you have admin rights, use the following to compile and install LoFreq:

./configure
make install

If you dont't have admin rights or want to install LoFreq to a non-standard directory use the --prefix flag to configure, e.g.

./configure --prefix $HOME/local/
make install

and make sure the corresponding installation sub-directory is in your PYTHONPATH.

Usage

The following describes the usage of the most recent version of LoFreq, which is 0.5.0. The following older versions of this document are available:

SNV calling with LoFreq

LoFreq takes a read mapping in BAM format as input. It's a good idea to be as stringent as possible with your mapping and to recalibrate base-call qualities as well. See the end of this document for recipes and scripts to achieve this (section 'best practices...'). A simple LoFreq call would look like this:

lofreq_snpcaller.py -f ref.fa -b mapping.bam -o raw-snv-output-file

If you want to limit the analysis to certain regions and have those described in a bed-file, then add -l bed-file. In almost all cases you will want to post-process the predicted SNV calls by applying some filtering criteria, for which you should use lofreq_filter.py (see section on filtering below).

Please note:

The default output format ('snp') is a simple csv-file: the 1st column gives the chromosome, 2nd column: SNV position, 3rd column: SNV-type, 4th column: frequency and the 5th column contains additional information about the SNV call. Alternatively, you produce output in vcf-format (--format vcf). However, some LoFreq scripts - most importantly lofreq_filter.py - do not work with this format at the moment. We are currently migrating to vcf as default.
LoFreq will distinguish between consensus-variants (type:consensus-var) and low-frequency variants (type:low-freq-var). Consensus variants are majority/consensus changes with respect to the reference and do not have quality values assigned (LoFreq is not meant to be a genotyping program). Low-frequency variants arise from subpopulations or non-dominant variants and by definition have an abundance/frequency of <50%. These are the ones LoFreq was designed to predict.

Options for SNV Calling

lofreq_snpcaller.py -h prints the full help if you need more advanced control.
Some of the more important options are described in the following:

Regions: If you want to limit the analysis to certain regions then you can pass a bed-file describing those regions to LoFreq with the option -l. This is useful for example for Exome sequencing.
Multiple testing correction: LoFreq calculates a p-value for each SNV call. Multiple testing correction (Bonferroni) is performed automatically by default (see --bonf option), so there should be no need for phred-value/SNP-quality based filtering afterwards.
Base-call qualities:
- To ignore any base below a certain quality use the option -Q. The default is 3, which is in accordance with Illumina guidelines.
- If your BAM file does not contain base-call quality values or if they are meaningless, you can try LoFreq's quality agnostic SNV calling module (option: --lofreq-nq-on --lofreq-q-off).
- LoFreq makes use of samtools' base-call quality correction (BAQ), which is useful if indel-errors are likely. LoFreq uses sensitive BAQ (-E) by default. You can influence this with the --baq option (not recommended, unless you know exactly what you are doing).

SNV filtering

Use lofreq_filter.py to filter SNV predictions produced by LoFreq. The two most highly recommended filter options are:

minimum coverage (--min-cov) and
strand-bias (either --strandbias-bonf or --strandbias-holmbonf; the latter is recommended),

An example call with recommended settings would look like this:

lofreq_filter.py --strandbias-holmbonf --min-cov 10 \
    -i raw-snv-file -o filtered-snv-file

Note that SNV quality filtering (--snp-phred) is largely unnecessary if the default and automatic Bonferroni correction was used during SNV calling.

Somatic SNV Calls / Unique SNV Calls in Sample Pairs

SNVs only called in one sample (e.g. cancer) but not in another paired sample (e.g. blood), can either be biologically interesting or simply due to low coverage in one sample. You can use lofreq_uniq.py to find out whether a call made only in one sample cannot be simply explained by the low coverage in the other (e.g. blood). lofreq_uniq.py takes as minimal input a file listing SNVs predicted in only one sample (see also lofreq_diff.py) and the other sample's BAM file. LoFreq comes with a script that automatically calls SNVs, filters them and finally derives unique SNVs (lofreq_uniq_pipeline.py). An example call looks like this:

lofreq_uniq_pipeline.py --bam1 first.bam --bam2 second.bam \
    --ref ref.fa --bed regions.bed -o output-dir

This pipeline requires a bed-file describing the regions of interest to calculate a Bonferroni factor automatically. You can derive a template for such a file using lofreq_regionbed.py. Output files can be found in output-dir.

Best practices for creating a BAM file for LoFreq

Especially for low frequency SNV calling it's best to use very stringent mapping criteria. Only keep properly aligned reads and only allow uniquely mapped reads. LoFreq comes with a script called bwa_unique.sh which will help you to create a unique mapping of your reads to a genome with BWA (Lee & Durbin, 2009)

We also highly recommend to perform a base-call quality calibration on your input. Spurious SNVs will otherwise be likely. For recalibration you can use GATK (McKenna et al., 2010) (www.broadinstitute.org). Since the GATK's usage is rather cumbersome especially for non-human data, LoFreq provides you with a wrapper script called base_qual_calib_wrapper.sh (based on GATK version 2). This will create a fake-vcf file of 'known' SNVs (needed for GATK) and execute the necessary GATK2 commands for recalibration. For human data you are better of providing the script with a 'real' vcf file from e.g. dbSNP.

Caveats

If your data was heavily PCR-amplified LoFreq will likely call SNVs in the primer regions as well, due to ambiguous primer positions, primer impurities etc. You should ignore primer positions after SNV calling.

Another problem with heavily PCRed input is that PCR artifacts might show up as low-frequency SNVs, especially if a mis-amplification happened in early cycles. No tool can or will ever be able to distinguish these from true low-frequency variants, since the mis-incorporated bases look real to the sequencing machine.

We've also observed an unexpectedly high number of low-frequency SNVs in single cell MDA amplified data.

Breseq: full-blown pipeline of which polymorphism-prediction is just a part
SNVer: low frequency SNV calling based on a frequentist approach
V-Phaser: based on similar ideas. Uses phasing for enhanced sensitivity. Focus on viral 454 data.

About

Publication: Please cite Wilm et al. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012; 40(22):11189-201 (www.ncbi.nlm.nih.gov).
LoFreq was developed in the Genome Institute of Singapore
Sourceforge Admins:

Project Admins:
- Niranjan Nagarajan
- Andreas Wilm

Please feel free to contact us if you find bugs, have suggestions, need help etc. Use the discussion forum, the mailing-list or simply mail us directly.