LMAT Wiki

Efficient taxonomic labeling of very large metagenomic datasets.

Brought to you by: bticktock, dhysom42, lloyd23, shea0, ska777

Example LMAT Run

Authors:

Overview

This page gives a brief overview of how LMAT is typically run, and example output to give an idea of output formats.

Running LMAT requires as input:

query file given in fasta format
reference search database (KmerDB)

The KmerDB is downloadable in various forms via anonymous ftp at:
ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/. Documentation included with the distribution shows users how to create a custom database as an option. LMAT uses additional files; these are included in the distribution and include data such as the taxonomy tree and gene label database taken from NCBI. Users should perform quality control filtering of query files prior to running LMAT. Since LMAT analyzes the k-mers in each read, N-masking can be used to direct analysis to the portions of each read with the higher quality sequencer values.

A typical usage would be:

./run_rl.sh --marker_library=LMAT_MARKER_LIBRARY --db_file=LMAT_DATABASE 
  --query_file=METAGENOMIC_QUERY_FILE

Where the query file (METAGENOMIC_QUERY_FILE) is searched against a reduced size database (LMAT_MARKER_LIBRARY) to get a quick answer on taxonomic contents, then the query set is searched against the full database (LMAT_DATABASE) to get a more complete accounting of each read and an improved assessment of sample contents.

The following example shows the steps needed to process human stool sample SRS049959 (ftp://public-ftp.hmpdacc.org/Illumina/stool/SRS049959.tar.bz2). An example is shown from HMP.

First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs

merge_fastq_reads_with_N_separator.pl 
  SRS049959.denovo_duplicates_marked.trimmed.1.fastq
  SRS049959.denovo_duplicates_marked.trimmed.2.fastq
  SRS049959.denovo_duplicates_marked.trimmed.fastq

Add singleton reads:

cat SRS049959.denovo_duplicates_marked.trimmed.singleton.fastq
  SRS049959.denovo_duplicates_marked.trimmed.fastq
  SRS049959.denovo_duplicates_marked.all.fastq

N-mask bases in the read with quality score lower than 10:

seqtk seq -A -q 10 -n N SRS049959.denovo_duplicates_marked.all.fastq
  SRS049959.denovo_duplicates_marked.q10mask.all.fasta

(seqtk is available here: https://github.com/lh3/seqtk )

Run the LMAT pipeline:

run_rl.sh --db_file=/local/ramfs/m9.db.16bit.compr500
  --query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta

(Here, only the full database was run, not the marker library)

Rank flexible search results

The table below shows partial output for the most abundant read assignments, where reads can be associated with any taxonomic rank. Each read is assigned a score, which is a log-ratio. The numerator is the percentage of k-mers in the read found in genomes associated with the assigned taxonomic value, and the denominator is the percentage of k-mers found in randomly generated reads with similar GC content. The “Weighted Read Score” in the table shows the sum of scores for all reads assigned to each taxonomic ID. The “Read Count” value is the number of reads assigned to the "Taxonomy Label." Finally, the "Average Score" is the "Weighted Read Score" divided by the "Read Count." Average Read Scores greater than 0 indicates that the level of similarity between the reads and genomes from the taxonomic identifier are significant. The table data was generated with LMAT-1.2. Output from other versions may show some slight variations.

In our example run, LMAT processed 59,853,377 reads and 10.87 Gbases. Read search took 2 hours 11 minutes (or 1.38 Mbases/sec). (All timings reported here are on a 40-core 1TB DRAM Westmere-EX machine).

Average Score	Weighted Read Score	Read Count	NCBI TaxID	Rank, Taxonomy Label
2.3	7.76343e+06	2920336	816	genus, Bacteroides
2.3	3.81975e+06	1657412	717959	no rank,Alistipes shahii WAL 8301
2.5	3.31537e+06	1348390	171549	order,Bacteroidales
1.8	3.30562e+06	1816644	537011	no rank,Prevotella copri DSM 18205

Gene Content Identification

Here are the 3 genes with the most read assignments. The output shows the number of reads assigned to each gene, the NCBI taxonomy ID assigned by LMAT, and the NCBI taxonomy ID associated with the gene in GenBank. Currently, Gene identification is a top hit approach, with no additional effort to match LMAT's taxonomy ID, with the associated Genbank taxonomy ID taken from the gene. Gene assignment took 10 minutes 49 sec (40-core 1TB DRAM machine).

READ COUNT	LMAT TAXID	NCBI TAXID	GENE ID	LOCUS TAG	DESCRIPTION	TYPE	PROTEIN ACCESSION
38075	657319	645463	8468138	CDR20291_1774	hypothetical protein	protein-coding	YP_003218265.1
21861	251695	398580	5712755	Dshi_1611	tyrosine recombinase	protein-coding	YP_003218265.1
13419	763034	709991	10255362	Odosp_3585	transposase IS4 family protein	protein-coding	YP_004254716.1

Organism Summary

Below is partial output for the taxonomy content and abundance summary output:

The relative abundance is the percentage of estimated genome copies associated with each organism, when k-mer coverage (or breadth of genome coverage) is greater than 1, otherwise copy number is set to 1 for purposes of abundance estimation. The genome copy estimate is currently counted simply as the median k-mer count for each genome. Column three shows the ratio of distinct k-mers counted in reads assigned to each organism divided by the number of distinct k-mers for the organisms found in the reference database. Ratios that approach a value of 2 are attributed to novel k-mers from sequencing error (when the genome has high coverage). For the content summary table below, reads originally assigned a higher rank are re-assigned to organisms, where possible. The table shows the number of total reads assigned (including those re-assigned), their weighted score, and the number of reads originally signed specifically to the organism (and their score). Organism summary runtime took 4 hours and 36 minutes (40-core 1TB DRAM machine). Runtime can be faster on less complex samples.

RELATIVE ABUNDANCE (total sums to 1)	GENOME copy estimate	OBSERVED DISTINCT k-mers /expected distinct k-mers	ASSIGNED reads	ASSIGNED reads (weighted score)	SPECIES/STRAIN specific assigned reads (weighted score)	SPECIES/STRAIN specific assigned reads	NCBI TAXID	NCBI RANK, ORGANISM NAME
0.108044	20	2.26963	2656606	4.40747e+06	3.30562e+06	1816644	537011	no rank, Prevotella copri DSM 18205
0.037815	47	1.3835	2154905	4.92342e+06	3.81975e+06	1657412	717959	no rank, Alistipes shahii WAL 8301
0.0648264	12	1.0396	2081701	4.38109e+06	1.66866e+06	781598	411483	no rank, Faecalibacterium prausnitzii A2-165

Wiki: Home