Read Me
Update 2015.03:
* Added a parameter to indicate supplementary protein sequences which will be
added to the target database (in specific mode only). For instance, add
translated and annotated (needs NCBI GIs!) genomic data for
proteogemomic simulation.
* Added a command-line script for the execution of LiDSiM on Windows.
Known Bug:
* Indicating extraction taxids of organisms which are not in the database
results in a null-pointer exception.
Update 2014.12:
* The last zip removed the executability flags of "lidsim" and "gifilter".
The new zip should contain executable files. If they are still not executable
at your pc, navigate to the lidsim files and run "chmod +x lidsim gifilter".
* Reduced default memory usage to 4 GB. If you still get errors like
"cannot allocate memory", use the --memory parameter to further reduce it.
LiDSiM: LImits of Detection SImulation for Microbes
Author: Mathias Kuhring
Contact: KuhringM@rki.de
https://sourceforge.net/projects/lidsim/
INTRODUCTION
------------
LiDSiM is tool to estimate the possible influence of error-tolerant database
searches and proteogenomic approaches on the amount of unidentified spectra and
the ratios of taxonomic relationship of identified spectra in MS/MS studies of
microbial proteomes.
For more details about LiDSiM and its functioning, please see
"Estimating the Computational Limits of Detection of Microbial Non-Model Organisms"
Mathias Kuhring and Bernhard Y. Renard
(Submitted manuscript)
PLEASE NOTE, it is recommended to read the paper and this readme.txt file before
using LiDSiM.
SYSTEM REQUIREMENTS
-------------------
The following software and libraries are required to run LiDSiM:
- Java 7
(Source: http://www.java.com)
- GNU R 3 including Rscript
(Source: http://www.r-project.org/)
- additional GNU R Packages: optparse
(Please refer to the R manuals on how to install packages)
Necessary to run simulations including contigs:
- transeq from the EMBOSS suite
(Source: http://emboss.sourceforge.net/)
- Blast+ including blastp and makeblastdb
(Source: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
All required executables (java, Rscript, transeq, blastp and makeblastdb) have
to be available in the PATH environment variable. Please refer to the operating
systems manual on how to set up the PATH variable, if manual setup is necessary.
DATA REQUIEREMENTS
------------------
Beside the database you want to examine, you'll need the NCBI gi-taxid mapping
file for proteins and the nodes of the NCBI taxonomy.
You can find the latest versions here: ftp://ftp.ncbi.nih.gov/pub/taxonomy
They are probably named gi_taxid_prot.zip and nodes.dmp (in taxdmp.zip).
You can also find the nodes file within the example data, however, I recommend
to download the latest version from NCBI.
PREPROCESSING
-------------
Because of the large size of the mapping file, I recommend a pre-filtering with
the gis in your database. A simple filter tool is already included with LiDSiM.
Just run "gifilter YOUR.DATABASE ID.MAPPING.FILE NEW.MAPPING.FILE"
RUNNING LiDSiM
--------------
LiDSiM is executed by running the "lidsim" script.
For more details about the parameters run "lidsim -h"
It requires at least 4 parameters:
1. The database to be evaluated (-d/--database)
2. The NCBI gi-taxid mapping file (-i/--idmapping)
(or preferably a pre-filtered file)
3. The NCBI taxonomy nodes (-n/--nodes)
4. A name for the result files (-o/--output)
Additionally, you can indicate different modes of simulation, including
different error-tolerance (-t/--tolerance), a complete iterative evaluation of
the database (default) or specific evaluations for particular organisms.
For instance, use
"--mode=complete --extraction=genus --sampling=species --amount=1"
to extract each genus in your database once, sample one representative of every
species in this genus and search their peptide against the remaining database.
Or use "--mode=specific --extraction=12345 --sampling=54321"
to search the peptides of the organism with taxid 54321 in the database
excluding the taxonomic subtree with taxid 12345.
To extend the simulation with genomic data, you need to indicate a file with
de novo assembled contigs (-c/--contigs) and indicate the corresponding organism
with the specific mode. The contigs will be translated and annotated using Blast.
You can also add you own annotated data (-e/--extension). However, keep in mind
to exclude the corresponding organism from the database you may use in your
annotation workflow to stay consistent with LiDSiM idea of simulating the
absence of this organism in the target database.
A seed can be set to reproduce internal data selection/sampling. This is useful,
e.g., to compare different conditions (e.g. exact, error-tolerant, +genome, ...)
The performance of LiDSiM can be adapted with parameters like
--threads, --memory, --diskmode and --peptides.
EXAMPLE DATA & RUN
------------------
You can download a set of example data from the project webpage:
https://sourceforge.net/projects/lidsim/
The file example-data.zip includes a database with the bacteria phylum
"Deinococcus-Thermus" (taxid 1297), a corresponding gi-taxid mapping file, a
NCBI taxonomy nodes file and simulated contigs from Deinococcus deserti VCD115
(taxid 546414).
Example run with an iterative evaluation, extracting each species once:
lidsim -d example-data/bacteria_w1297.fasta -i example-data/phy1297_gi_taxid_prot.dmp -n example-data/nodes.dmp -o example-data/result_test
Example run with a specific evaluation and additional contigs, extracting the
Deinococcus genus and searching with Deinococcus deserti VCD115 peptides:
lidsim -d example-data/bacteria_w1297.fasta -i example-data/phy1297_gi_taxid_prot.dmp -n example-data/nodes.dmp -o example-data/result_test -c example-data/deserti_contigs.fasta --mode=specific --extraction=1298 --sampling=546414
OUTPUT
------
The simulation exports two main result files:
(* donates the output name indicate with the parameter -o/--output)
*.ratios.txt - the taxonomic level/ranks ratios in a tab-delimited table
*.plot.pdf - a plot of the ratios
The plot format depends on the number of extractions in the simulation:
1 - barplot with a bar per ranks
1-10 - barplot with with ranks stacked in one bar per extraction
>10 - heatmap with extractions per columns and rank ratios indicated by color
--------------------------------------------------------------------------------
Copyright (c) 2014,
Mathias Kuhring, KuhringM@rki.de, Robert Koch Institute, Germany,
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* The name of the author may not be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Mathias Kuhring BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.