Recent changes to Example LMAT Run

Example LMAT Run modified by Jonathan

Jonathan — Fri, 16 Oct 2015 00:55:23 -0000

--- v7
+++ v8
@@ -15,7 +15,7 @@

 A typical usage would be:

-    ./run_lmat.sh --marker_library=LMAT_MARKER_LIBRARY --db_file=LMAT_DATABASE 
+    ./run_rl.sh --marker_library=LMAT_MARKER_LIBRARY --db_file=LMAT_DATABASE 
       --query_file=METAGENOMIC_QUERY_FILE

 Where the query file (METAGENOMIC_QUERY_FILE) is searched against a reduced size database (LMAT_MARKER_LIBRARY) to get a quick answer on taxonomic contents, then the query set is searched against the full database (LMAT_DATABASE) to get a more complete accounting of each read and an improved assessment of sample contents.
@@ -48,7 +48,7 @@
 Run the LMAT pipeline:

-    run_lmat.sh --db_file=/local/ramfs/m9.db.16bit.compr500
+    run_rl.sh --db_file=/local/ramfs/m9.db.16bit.compr500
       --query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta

 (Here, only the full database was run, not the marker library)

Example LMAT Run modified by Sasha

Sasha — Thu, 15 May 2014 22:31:14 -0000

--- v6
+++ v7
@@ -91,6 +91,5 @@
 RELATIVE ABUNDANCE (total sums to 1) | GENOME copy estimate |OBSERVED DISTINCT k-mers /expected distinct k-mers | ASSIGNED reads |  ASSIGNED reads (weighted score) | SPECIES/STRAIN specific assigned reads (weighted score) | SPECIES/STRAIN specific assigned reads | NCBI TAXID | NCBI RANK, ORGANISM NAME
 -------- | -- | ------- | ------- | ----------- | ----------- | ------- | ------ | ------------------
 0.108044 | 20 | 2.26963 | 2656606 | 4.40747e+06 | 3.30562e+06 | 1816644 | 537011 | no rank, Prevotella copri DSM 18205
-0.03781547 | 1.3835 | 2154905 | 4.92342e+06 | 3.81975e+06 | 1657412 | 717959 | no rank, Alistipes shahii 
-WAL 8301
+0.037815 | 47 | 1.3835 | 2154905 | 4.92342e+06 | 3.81975e+06 | 1657412 | 717959 | no rank, Alistipes shahii WAL 8301
 0.0648264 | 12 | 1.0396 | 2081701 | 4.38109e+06 | 1.66866e+06 | 781598 | 411483 | no rank, Faecalibacterium prausnitzii A2-165

Example LMAT Run modified by Sasha

Sasha — Thu, 15 May 2014 22:29:56 -0000

--- v5
+++ v6
@@ -91,8 +91,6 @@
 RELATIVE ABUNDANCE (total sums to 1) | GENOME copy estimate |OBSERVED DISTINCT k-mers /expected distinct k-mers | ASSIGNED reads |  ASSIGNED reads (weighted score) | SPECIES/STRAIN specific assigned reads (weighted score) | SPECIES/STRAIN specific assigned reads | NCBI TAXID | NCBI RANK, ORGANISM NAME
 -------- | -- | ------- | ------- | ----------- | ----------- | ------- | ------ | ------------------
 0.108044 | 20 | 2.26963 | 2656606 | 4.40747e+06 | 3.30562e+06 | 1816644 | 537011 | no rank, Prevotella copri DSM 18205
-0.03781547 1.3835  2154905 4.92342e+06 3.81975e+06 1657412 717959  no rank,
-Alistipes shahii 
+0.03781547 | 1.3835 | 2154905 | 4.92342e+06 | 3.81975e+06 | 1657412 | 717959 | no rank, Alistipes shahii 
 WAL 8301
-0.0648264  12  1.0396  2081701 4.38109e+06 1.66866e+06 781598  411483  no rank, Faecalibacterium prausnitzii
-A2-165
+0.0648264 | 12 | 1.0396 | 2081701 | 4.38109e+06 | 1.66866e+06 | 781598 | 411483 | no rank, Faecalibacterium prausnitzii A2-165

Example LMAT Run modified by Sasha

Sasha — Thu, 15 May 2014 22:27:32 -0000

--- v4
+++ v5
@@ -1,7 +1,7 @@
 Overview
 ========

-This page gives a brief overview of how LMAT is typically run, and example output to give an idea of output formats.
+This page gives a brief overview of how LMAT is typically run, and example output to give an idea of output formats. 

 Running LMAT requires as input:

@@ -56,7 +56,7 @@
 Rank flexible search results
 ============================

-The table below shows partial output for the most abundant read assignments, where reads can be associated with any taxonomic rank.  Each read is assigned a score, which is a log-ratio.  The numerator is the percentage of k-mers in the read found in genomes associated with the assigned taxonomic value, and the denominator is the percentage of k-mers found in randomly generated reads with similar GC content.  The “Weighted Read Score” in the table   shows the sum of scores for all reads assigned to each taxonomic ID.  The “Read Count” value is the number of reads assigned to the "Taxonomy Label." Finally, the "Average Score" is the "Weighted Read Score" divided by the "Read Count." Average Read Scores greater than 0 indicates that the level of similarity between the reads and genomes from the taxonomic identifier are  significant. 
+The table below shows partial output for the most abundant read assignments, where reads can be associated with any taxonomic rank.  Each read is assigned a score, which is a log-ratio.  The numerator is the percentage of k-mers in the read found in genomes associated with the assigned taxonomic value, and the denominator is the percentage of k-mers found in randomly generated reads with similar GC content.  The “Weighted Read Score” in the table   shows the sum of scores for all reads assigned to each taxonomic ID.  The “Read Count” value is the number of reads assigned to the "Taxonomy Label." Finally, the "Average Score" is the "Weighted Read Score" divided by the "Read Count." Average Read Scores greater than 0 indicates that the level of similarity between the reads and genomes from the taxonomic identifier are  significant.  The table data was generated with LMAT-1.2.  Output from other versions may show some slight variations.

 In our example run, LMAT processed 59,853,377 reads and 10.87 Gbases. Read search took 2 hours 11 minutes (or 1.38 Mbases/sec).  (All timings reported here are on a 40-core 1TB DRAM Westmere-EX machine).

Example LMAT Run modified by Sasha

Sasha — Thu, 08 May 2014 17:19:00 -0000

--- v3
+++ v4
@@ -20,7 +20,7 @@

 Where the query file (METAGENOMIC_QUERY_FILE) is searched against a reduced size database (LMAT_MARKER_LIBRARY) to get a quick answer on taxonomic contents, then the query set is searched against the full database (LMAT_DATABASE) to get a more complete accounting of each read and an improved assessment of sample contents.

-The following example shows the steps needed to process human stool sample SRS049959 (ftp://public-ftp.hmpdacc.org/Illumina/stool/SRS049959.tar.bz2). An example is shown from HMP.
+The following example shows the steps needed to process human stool sample SRS049959 (). An example is shown from HMP.

 First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs

@@ -43,7 +43,7 @@
     seqtk seq -A -q 10 -n N SRS049959.denovo_duplicates_marked.all.fastq
       SRS049959.denovo_duplicates_marked.q10mask.all.fasta

-(seqtk is available here: https://github.com/lh3/seqtk )
+(seqtk is available here:  )

 Run the LMAT pipeline:

@@ -66,3 +66,33 @@
 2.3           | 3.81975e+06         | 1657412    | 717959     |    no rank,Alistipes shahii WAL 8301
 2.5           | 3.31537e+06         | 1348390    | 171549     | order,Bacteroidales
 1.8           | 3.30562e+06         | 1816644    | 537011     | no rank,Prevotella copri DSM 18205
+
+
+
+
+Gene Content Identification
+---------------------------
+
+Here are the 3 genes with the most read assignments.  The output shows the number of reads assigned to each gene, the NCBI taxonomy ID assigned by LMAT, and the NCBI taxonomy ID associated with the gene in GenBank.  Currently, Gene identification is a top hit approach, with no additional effort to match LMAT's taxonomy ID, with the associated Genbank taxonomy ID taken from the gene.  Gene assignment took 10 minutes 49 sec (40-core 1TB DRAM machine).
+
+READ COUNT | LMAT TAXID | NCBI TAXID | GENE ID    | LOCUS TAG | DESCRIPTION | TYPE | PROTEIN ACCESSION
+---------- | ---------- | ---------- | -------    | ---------     | ----------- | ---- | -----------------
+38075      | 657319     | 645463     | 8468138    | CDR20291_1774 | hypothetical protein | protein-coding | YP_003218265.1
+21861 | 251695 | 398580 | 5712755 | Dshi_1611 | tyrosine recombinase | protein-coding | YP_003218265.1
+13419 | 763034 | 709991 | 10255362 | Odosp_3585 | transposase IS4 family protein | protein-coding | YP_004254716.1
+
+Organism Summary
+----------------
+
+Below is partial output for the taxonomy content and abundance summary output:
+
+The relative abundance is the percentage of estimated genome copies associated with each organism, when k-mer coverage (or breadth of genome coverage) is greater than 1, otherwise copy number is set to 1 for purposes of abundance estimation.  The genome copy estimate is currently counted simply as the median k-mer count for each genome.  Column three shows the ratio of distinct k-mers counted in reads assigned to each organism divided by the number of distinct k-mers for the organisms found in the reference database.  Ratios that approach a value of 2 are attributed to novel k-mers from sequencing error (when the genome has high coverage).  For the content summary table below, reads originally assigned a higher rank are re-assigned to organisms, where possible.  The table shows the number of total reads assigned (including those re-assigned), their weighted score, and the number of reads originally signed specifically to the organism (and their score). Organism summary runtime took 4 hours and 36 minutes (40-core 1TB DRAM machine). Runtime can be faster on less complex samples.
+
+RELATIVE ABUNDANCE (total sums to 1) | GENOME copy estimate |OBSERVED DISTINCT k-mers /expected distinct k-mers | ASSIGNED reads |  ASSIGNED reads (weighted score) | SPECIES/STRAIN specific assigned reads (weighted score) | SPECIES/STRAIN specific assigned reads | NCBI TAXID | NCBI RANK, ORGANISM NAME
+-------- | -- | ------- | ------- | ----------- | ----------- | ------- | ------ | ------------------
+0.108044 | 20 | 2.26963 | 2656606 | 4.40747e+06 | 3.30562e+06 | 1816644 | 537011 | no rank, Prevotella copri DSM 18205
+0.03781547 1.3835  2154905 4.92342e+06 3.81975e+06 1657412 717959  no rank,
+Alistipes shahii 
+WAL 8301
+0.0648264  12  1.0396  2081701 4.38109e+06 1.66866e+06 781598  411483  no rank, Faecalibacterium prausnitzii
+A2-165

Example LMAT Run modified by Sasha

Sasha — Thu, 08 May 2014 15:40:58 -0000

--- v2
+++ v3
@@ -60,4 +60,9 @@

 In our example run, LMAT processed 59,853,377 reads and 10.87 Gbases. Read search took 2 hours 11 minutes (or 1.38 Mbases/sec).  (All timings reported here are on a 40-core 1TB DRAM Westmere-EX machine).

-
+Average Score | Weighted Read Score | Read Count | NCBI TaxID | Rank, Taxonomy Label
+------------- | ------------------- | ---------- | ---------- | --------------------
+2.3           | 7.76343e+06         | 2920336    | 816        |    genus, Bacteroides
+2.3           | 3.81975e+06         | 1657412    | 717959     |    no rank,Alistipes shahii WAL 8301
+2.5           | 3.31537e+06         | 1348390    | 171549     | order,Bacteroidales
+1.8           | 3.30562e+06         | 1816644    | 537011     | no rank,Prevotella copri DSM 18205

Example LMAT Run modified by Sasha

Sasha — Thu, 08 May 2014 15:32:12 -0000

--- v1
+++ v2
@@ -8,8 +8,10 @@
 * query file given in fasta format
 * reference search database (KmerDB)

+
+
 The KmerDB is downloadable in various forms via anonymous ftp at: 
-ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/.  Documentation included with the distribution shows users how to create a custom database as an option.  LMAT uses additional files; these are included in the distribution and include data such as the taxonomy tree and gene label database taken from NCBI.  Users should perform quality control filtering of query files prior to running LMAT.  Since LMAT analyzes the k-mers in each read, N-masking can be used to direct analysis to the portions of each read with the higher quality sequencer values.
+.  Documentation included with the distribution shows users how to create a custom database as an option.  LMAT uses additional files; these are included in the distribution and include data such as the taxonomy tree and gene label database taken from NCBI.  Users should perform quality control filtering of query files prior to running LMAT.  Since LMAT analyzes the k-mers in each read, N-masking can be used to direct analysis to the portions of each read with the higher quality sequencer values.

 A typical usage would be:

@@ -20,27 +22,34 @@

 The following example shows the steps needed to process human stool sample SRS049959 (ftp://public-ftp.hmpdacc.org/Illumina/stool/SRS049959.tar.bz2). An example is shown from HMP.

-* First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs
+First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs

     merge_fastq_reads_with_N_separator.pl 
-    SRS049959.denovo_duplicates_marked.trimmed.1.fastq
-    SRS049959.denovo_duplicates_marked.trimmed.2.fastq
-    SRS049959.denovo_duplicates_marked.trimmed.fastq
+      SRS049959.denovo_duplicates_marked.trimmed.1.fastq
+      SRS049959.denovo_duplicates_marked.trimmed.2.fastq
+      SRS049959.denovo_duplicates_marked.trimmed.fastq

-* Add singleton reads:
+
+Add singleton reads:
+
+
     cat SRS049959.denovo_duplicates_marked.trimmed.singleton.fastq
-    SRS049959.denovo_duplicates_marked.trimmed.fastq
-    SRS049959.denovo_duplicates_marked.all.fastq
+      SRS049959.denovo_duplicates_marked.trimmed.fastq
+      SRS049959.denovo_duplicates_marked.all.fastq

-* N-mask bases in the read with quality score lower than 10.
+N-mask bases in the read with quality score lower than 10:
+
+
     seqtk seq -A -q 10 -n N SRS049959.denovo_duplicates_marked.all.fastq
-    SRS049959.denovo_duplicates_marked.q10mask.all.fasta
+      SRS049959.denovo_duplicates_marked.q10mask.all.fasta

 (seqtk is available here: https://github.com/lh3/seqtk )

-* Run the LMAT pipeline:
+Run the LMAT pipeline:
+
+
     run_lmat.sh --db_file=/local/ramfs/m9.db.16bit.compr500
-    --query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta
+      --query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta

 (Here, only the full database was run, not the marker library)

Example LMAT Run modified by Sasha

Sasha — Thu, 08 May 2014 15:21:27 -0000

Overview

This page gives a brief overview of how LMAT is typically run, and example output to give an idea of output formats.

Running LMAT requires as input:

query file given in fasta format
reference search database (KmerDB)

The KmerDB is downloadable in various forms via anonymous ftp at:
ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/. Documentation included with the distribution shows users how to create a custom database as an option. LMAT uses additional files; these are included in the distribution and include data such as the taxonomy tree and gene label database taken from NCBI. Users should perform quality control filtering of query files prior to running LMAT. Since LMAT analyzes the k-mers in each read, N-masking can be used to direct analysis to the portions of each read with the higher quality sequencer values.

A typical usage would be:

./run_lmat.sh --marker_library=LMAT_MARKER_LIBRARY --db_file=LMAT_DATABASE 
  --query_file=METAGENOMIC_QUERY_FILE

Where the query file (METAGENOMIC_QUERY_FILE) is searched against a reduced size database (LMAT_MARKER_LIBRARY) to get a quick answer on taxonomic contents, then the query set is searched against the full database (LMAT_DATABASE) to get a more complete accounting of each read and an improved assessment of sample contents.

The following example shows the steps needed to process human stool sample SRS049959 (ftp://public-ftp.hmpdacc.org/Illumina/stool/SRS049959.tar.bz2). An example is shown from HMP.

First merge read pairs into a single read (simple perl utility script included in distribution). LMAT tracks non-redundant k-mers on both strands to alleviate the need for more involved merging and strand orientation of the read pairs

merge_fastq_reads_with_N_separator.pl
SRS049959.denovo_duplicates_marked.trimmed.1.fastq
SRS049959.denovo_duplicates_marked.trimmed.2.fastq
SRS049959.denovo_duplicates_marked.trimmed.fastq
Add singleton reads:
cat SRS049959.denovo_duplicates_marked.trimmed.singleton.fastq
SRS049959.denovo_duplicates_marked.trimmed.fastq
SRS049959.denovo_duplicates_marked.all.fastq
N-mask bases in the read with quality score lower than 10.
seqtk seq -A -q 10 -n N SRS049959.denovo_duplicates_marked.all.fastq
SRS049959.denovo_duplicates_marked.q10mask.all.fasta

(seqtk is available here: https://github.com/lh3/seqtk )

Run the LMAT pipeline:
run_lmat.sh --db_file=/local/ramfs/m9.db.16bit.compr500
--query_file=SRS049959.denovo_duplicates_marked.q10mask.all.fasta

(Here, only the full database was run, not the marker library)

Rank flexible search results

The table below shows partial output for the most abundant read assignments, where reads can be associated with any taxonomic rank. Each read is assigned a score, which is a log-ratio. The numerator is the percentage of k-mers in the read found in genomes associated with the assigned taxonomic value, and the denominator is the percentage of k-mers found in randomly generated reads with similar GC content. The “Weighted Read Score” in the table shows the sum of scores for all reads assigned to each taxonomic ID. The “Read Count” value is the number of reads assigned to the "Taxonomy Label." Finally, the "Average Score" is the "Weighted Read Score" divided by the "Read Count." Average Read Scores greater than 0 indicates that the level of similarity between the reads and genomes from the taxonomic identifier are significant.

In our example run, LMAT processed 59,853,377 reads and 10.87 Gbases. Read search took 2 hours 11 minutes (or 1.38 Mbases/sec). (All timings reported here are on a 40-core 1TB DRAM Westmere-EX machine).