Welcome to the LoFreq wiki!
LoFreq is a fast and sensitive variant-caller for inferring single-nucleotide variants (SNVs) from high-throughput sequencing data. It is designed to robustly call low-frequency variants by exploiting base-call quality values. LoFreq has been used to call rare variants in viral and bacterial sequencing datasets and can be used to study mitochondrial heteroplasmy and rare somatic mutations in heterogeneous tumors.
LoFreq makes full use of base-call qualities (and versions >0.5 also read mapping qualities) which are usually ignored by other methods or only used for filtering. It is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate). Each SNV call is assigned a p-value which allows for rigorous false positive control. Even though it uses no approximations or heuristics, it is very efficient due to several runtime optimizations. LoFreq is generic and fast enough to be applied to high-coverage data and large genomes. It takes a minute to analyze Dengue genome sequencing data with nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and 1.5 hours to run on a 100X coverage human exome dataset.
For more details see: Andreas Wilm, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo,
Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd and Niranjan Nagarajan. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. (www.ncbi.nlm.nih.gov) Nucleic Acids Res. 2012; 40(22):11189-201.
You will need a C compile, Python 2.7 (including the developer files, i.e. headers etc) and the zlib developer files (all of which are probably already installed on your system). Download the source for the latest LoFreq distribution (using the development version in GIT is not recommended), unpack it and change the working directory to the newly created directory. Assuming you have admin rights, use the following to compile and install LoFreq:
./configure
make install
If you dont't have admin rights or want to install LoFreq to a non-standard directory use the --prefix flag to configure, e.g.
./configure --prefix $HOME/local/
make install
and make sure the corresponding installation sub-directory is in your PYTHONPATH.
The following describes the usage of the most recent version of LoFreq, which is 0.5.0. The following older versions of this document are available:
LoFreq takes a read mapping in BAM format as input. It's a good idea to be as stringent as possible with your mapping and to recalibrate base-call qualities as well. See the end of this document for recipes and scripts to achieve this (section 'best practices...'). A simple LoFreq call would look like this:
lofreq_snpcaller.py -f ref.fa -b mapping.bam -o raw-snv-output-file
If you want to limit the analysis to certain regions and have those described in a bed-file, then add -l bed-file. In almost all cases you will want to post-process the predicted SNV calls by applying some filtering criteria, for which you should use lofreq_filter.py (see section on filtering below).
Please note:
--format vcf). However, some LoFreq scripts - most importantly lofreq_filter.py - do not work with this format at the moment. We are currently migrating to vcf as default.type:consensus-var) and low-frequency variants (type:low-freq-var). Consensus variants are majority/consensus changes with respect to the reference and do not have quality values assigned (LoFreq is not meant to be a genotyping program). Low-frequency variants arise from subpopulations or non-dominant variants and by definition have an abundance/frequency of <50%. These are the ones LoFreq was designed to predict.lofreq_snpcaller.py -h prints the full help if you need more advanced control.
Some of the more important options are described in the following:
-l. This is useful for example for Exome sequencing. --bonf option), so there should be no need for phred-value/SNP-quality based filtering afterwards.-Q. The default is 3, which is in accordance with Illumina guidelines.--lofreq-nq-on --lofreq-q-off).-E) by default. You can influence this with the --baq option (not recommended, unless you know exactly what you are doing).Use lofreq_filter.py to filter SNV predictions produced by LoFreq. The two most highly recommended filter options are:
--min-cov) and--strandbias-bonf or --strandbias-holmbonf; the latter is recommended),An example call with recommended settings would look like this:
lofreq_filter.py --strandbias-holmbonf --min-cov 10 \
-i raw-snv-file -o filtered-snv-file
Note that SNV quality filtering (--snp-phred) is largely unnecessary if the default and automatic Bonferroni correction was used during SNV calling.
SNVs only called in one sample (e.g. cancer) but not in another paired sample (e.g. blood), can either be biologically interesting or simply due to low coverage in one sample. You can use lofreq_uniq.py to find out whether a call made only in one sample cannot be simply explained by the low coverage in the other (e.g. blood). lofreq_uniq.py takes as minimal input a file listing SNVs predicted in only one sample (see also lofreq_diff.py) and the other sample's BAM file. LoFreq comes with a script that automatically calls SNVs, filters them and finally derives unique SNVs (lofreq_uniq_pipeline.py). An example call looks like this:
lofreq_uniq_pipeline.py --bam1 first.bam --bam2 second.bam \
--ref ref.fa --bed regions.bed -o output-dir
This pipeline requires a bed-file describing the regions of interest to calculate a Bonferroni factor automatically. You can derive a template for such a file using lofreq_regionbed.py. Output files can be found in output-dir.
Especially for low frequency SNV calling it's best to use very stringent mapping criteria. Only keep properly aligned reads and only allow uniquely mapped reads. LoFreq comes with a script called bwa_unique.sh which will help you to create a unique mapping of your reads to a genome with BWA (Lee & Durbin, 2009)
We also highly recommend to perform a base-call quality calibration on your input. Spurious SNVs will otherwise be likely. For recalibration you can use GATK (McKenna et al., 2010) (www.broadinstitute.org). Since the GATK's usage is rather cumbersome especially for non-human data, LoFreq provides you with a wrapper script called base_qual_calib_wrapper.sh (based on GATK version 2). This will create a fake-vcf file of 'known' SNVs (needed for GATK) and execute the necessary GATK2 commands for recalibration. For human data you are better of providing the script with a 'real' vcf file from e.g. dbSNP.
If your data was heavily PCR-amplified LoFreq will likely call SNVs in the primer regions as well, due to ambiguous primer positions, primer impurities etc. You should ignore primer positions after SNV calling.
Another problem with heavily PCRed input is that PCR artifacts might show up as low-frequency SNVs, especially if a mis-amplification happened in early cycles. No tool can or will ever be able to distinguish these from true low-frequency variants, since the mis-incorporated bases look real to the sequencing machine.
We've also observed an unexpectedly high number of low-frequency SNVs in single cell MDA amplified data.
Publication: Please cite Wilm et al. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012; 40(22):11189-201 (www.ncbi.nlm.nih.gov).
LoFreq was developed in the Genome Institute of Singapore
Please feel free to contact us if you find bugs, have suggestions, need help etc. Use the discussion forum, the mailing-list or simply mail us directly.