This page gives a brief overview of how LMAT is typically run, and example output to give an idea of output formats.
Running LMAT requires as input:
The KmerDB is downloadable in various forms via anonymous ftp at:
ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/. Documentation included with the distribution shows users how to create a custom database as an option. LMAT uses additional files; these are included in the distribution and include data such as the taxonomy tree and gene label database taken from NCBI. Users should perform quality control filtering of query files prior to running LMAT. Since LMAT analyzes the k-mers in each read, N-masking can be used to direct analysis to the portions of each read with the higher quality sequencer values.
A typical usage would be:
./run_rl.sh --marker_library=LMAT_MARKER_LIBRARY --db_file=LMAT_DATABASE
--query_file=METAGENOMIC_QUERY_FILE
Where the query file (METAGENOMIC_QUERY_FILE) is searched against a reduced size database (LMAT_MARKER_LIBRARY) to get a quick answer on taxonomic contents, then the query set is searched against the full database (LMAT_DATABASE) to get a more complete accounting of each read and an improved assessment of sample contents.
The following example shows the steps needed to process human stool sample SRS049959 (ftp://public-ftp.hmpdacc.org/Illumina/stool/SRS049959.tar.bz2). An example is shown from HMP.
First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs
merge_fastq_reads_with_N_separator.pl
SRS049959.denovo_duplicates_marked.trimmed.1.fastq
SRS049959.denovo_duplicates_marked.trimmed.2.fastq
SRS049959.denovo_duplicates_marked.trimmed.fastq
Add singleton reads:
cat SRS049959.denovo_duplicates_marked.trimmed.singleton.fastq
SRS049959.denovo_duplicates_marked.trimmed.fastq
SRS049959.denovo_duplicates_marked.all.fastq
N-mask bases in the read with quality score lower than 10:
seqtk seq -A -q 10 -n N SRS049959.denovo_duplicates_marked.all.fastq
SRS049959.denovo_duplicates_marked.q10mask.all.fasta
(seqtk is available here: https://github.com/lh3/seqtk )
Run the LMAT pipeline:
run_rl.sh --db_file=/local/ramfs/m9.db.16bit.compr500
--query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta
(Here, only the full database was run, not the marker library)
The table below shows partial output for the most abundant read assignments, where reads can be associated with any taxonomic rank. Each read is assigned a score, which is a log-ratio. The numerator is the percentage of k-mers in the read found in genomes associated with the assigned taxonomic value, and the denominator is the percentage of k-mers found in randomly generated reads with similar GC content. The “Weighted Read Score” in the table shows the sum of scores for all reads assigned to each taxonomic ID. The “Read Count” value is the number of reads assigned to the "Taxonomy Label." Finally, the "Average Score" is the "Weighted Read Score" divided by the "Read Count." Average Read Scores greater than 0 indicates that the level of similarity between the reads and genomes from the taxonomic identifier are significant. The table data was generated with LMAT-1.2. Output from other versions may show some slight variations.
In our example run, LMAT processed 59,853,377 reads and 10.87 Gbases. Read search took 2 hours 11 minutes (or 1.38 Mbases/sec). (All timings reported here are on a 40-core 1TB DRAM Westmere-EX machine).
| Average Score | Weighted Read Score | Read Count | NCBI TaxID | Rank, Taxonomy Label |
|---|---|---|---|---|
| 2.3 | 7.76343e+06 | 2920336 | 816 | genus, Bacteroides |
| 2.3 | 3.81975e+06 | 1657412 | 717959 | no rank,Alistipes shahii WAL 8301 |
| 2.5 | 3.31537e+06 | 1348390 | 171549 | order,Bacteroidales |
| 1.8 | 3.30562e+06 | 1816644 | 537011 | no rank,Prevotella copri DSM 18205 |
Here are the 3 genes with the most read assignments. The output shows the number of reads assigned to each gene, the NCBI taxonomy ID assigned by LMAT, and the NCBI taxonomy ID associated with the gene in GenBank. Currently, Gene identification is a top hit approach, with no additional effort to match LMAT's taxonomy ID, with the associated Genbank taxonomy ID taken from the gene. Gene assignment took 10 minutes 49 sec (40-core 1TB DRAM machine).
| READ COUNT | LMAT TAXID | NCBI TAXID | GENE ID | LOCUS TAG | DESCRIPTION | TYPE | PROTEIN ACCESSION |
|---|---|---|---|---|---|---|---|
| 38075 | 657319 | 645463 | 8468138 | CDR20291_1774 | hypothetical protein | protein-coding | YP_003218265.1 |
| 21861 | 251695 | 398580 | 5712755 | Dshi_1611 | tyrosine recombinase | protein-coding | YP_003218265.1 |
| 13419 | 763034 | 709991 | 10255362 | Odosp_3585 | transposase IS4 family protein | protein-coding | YP_004254716.1 |
Below is partial output for the taxonomy content and abundance summary output:
The relative abundance is the percentage of estimated genome copies associated with each organism, when k-mer coverage (or breadth of genome coverage) is greater than 1, otherwise copy number is set to 1 for purposes of abundance estimation. The genome copy estimate is currently counted simply as the median k-mer count for each genome. Column three shows the ratio of distinct k-mers counted in reads assigned to each organism divided by the number of distinct k-mers for the organisms found in the reference database. Ratios that approach a value of 2 are attributed to novel k-mers from sequencing error (when the genome has high coverage). For the content summary table below, reads originally assigned a higher rank are re-assigned to organisms, where possible. The table shows the number of total reads assigned (including those re-assigned), their weighted score, and the number of reads originally signed specifically to the organism (and their score). Organism summary runtime took 4 hours and 36 minutes (40-core 1TB DRAM machine). Runtime can be faster on less complex samples.
| RELATIVE ABUNDANCE (total sums to 1) | GENOME copy estimate | OBSERVED DISTINCT k-mers /expected distinct k-mers | ASSIGNED reads | ASSIGNED reads (weighted score) | SPECIES/STRAIN specific assigned reads (weighted score) | SPECIES/STRAIN specific assigned reads | NCBI TAXID | NCBI RANK, ORGANISM NAME |
|---|---|---|---|---|---|---|---|---|
| 0.108044 | 20 | 2.26963 | 2656606 | 4.40747e+06 | 3.30562e+06 | 1816644 | 537011 | no rank, Prevotella copri DSM 18205 |
| 0.037815 | 47 | 1.3835 | 2154905 | 4.92342e+06 | 3.81975e+06 | 1657412 | 717959 | no rank, Alistipes shahii WAL 8301 |
| 0.0648264 | 12 | 1.0396 | 2081701 | 4.38109e+06 | 1.66866e+06 | 781598 | 411483 | no rank, Faecalibacterium prausnitzii A2-165 |