Menu

Tree [r2] /
 History

HTTPS access


File Date Author Commit
 r 2015-03-23 mkuhring [r2] Added new parameter and a Windows script
 src 2014-12-23 mkuhring [r1] Initial commit
 gifilter 2014-12-23 mkuhring [r1] Initial commit
 license.txt 2014-12-23 mkuhring [r1] Initial commit
 lidsim 2015-03-23 mkuhring [r2] Added new parameter and a Windows script
 lidsim.cmd 2015-03-23 mkuhring [r2] Added new parameter and a Windows script
 lidsim.jar 2014-12-23 mkuhring [r1] Initial commit
 readme.txt 2015-03-23 mkuhring [r2] Added new parameter and a Windows script

Read Me

Update 2015.03:
* Added a parameter to indicate supplementary protein sequences which will be
  added to the target database (in specific mode only). For instance, add
  translated and annotated (needs NCBI GIs!) genomic data for
  proteogemomic simulation.
* Added a command-line script for the execution of LiDSiM on Windows.

Known Bug:
* Indicating extraction taxids of organisms which are not in the database
  results in a null-pointer exception.


Update 2014.12: 
* The last zip removed the executability flags of "lidsim" and "gifilter".
  The new zip should contain executable files. If they are still not executable
  at your pc, navigate to the lidsim files and run "chmod +x lidsim gifilter".
* Reduced default memory usage to 4 GB. If you still get errors like
  "cannot allocate memory", use the --memory parameter to further reduce it.


LiDSiM: LImits of Detection SImulation for Microbes 

Author:  Mathias Kuhring
Contact: KuhringM@rki.de

https://sourceforge.net/projects/lidsim/


INTRODUCTION
------------

LiDSiM is tool to estimate the possible influence of error-tolerant database 
searches and proteogenomic approaches on the amount of unidentified spectra and 
the ratios of taxonomic relationship of identified spectra in MS/MS studies of 
microbial proteomes.

For more details about LiDSiM and its functioning, please see

"Estimating the Computational Limits of Detection of Microbial Non-Model Organisms"
Mathias Kuhring and Bernhard Y. Renard
(Submitted manuscript)

PLEASE NOTE, it is recommended to read the paper and this readme.txt file before 
using LiDSiM.


SYSTEM REQUIREMENTS
-------------------
The following software and libraries are required to run LiDSiM:
- Java 7 
  (Source: http://www.java.com)
- GNU R 3 including Rscript 
  (Source: http://www.r-project.org/)
- additional GNU R Packages: optparse
  (Please refer to the R manuals on how to install packages)

Necessary to run simulations including contigs:  
- transeq from the EMBOSS suite
  (Source: http://emboss.sourceforge.net/)
- Blast+ including blastp and makeblastdb
  (Source: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)

All required executables (java, Rscript, transeq, blastp and makeblastdb) have 
to be available in the PATH environment variable. Please refer to the operating 
systems manual on how to set up the PATH variable, if manual setup is necessary.


DATA REQUIEREMENTS
------------------
Beside the database you want to examine, you'll need the NCBI gi-taxid mapping
file for proteins and the nodes of the NCBI taxonomy. 

You can find the latest versions here: ftp://ftp.ncbi.nih.gov/pub/taxonomy
They are probably named gi_taxid_prot.zip and nodes.dmp (in taxdmp.zip).

You can also find the nodes file within the example data, however, I recommend 
to download the latest version from NCBI.


PREPROCESSING
-------------
Because of the large size of the mapping file, I recommend a pre-filtering with
the gis in your database. A simple filter tool is already included with LiDSiM.

Just run "gifilter YOUR.DATABASE ID.MAPPING.FILE NEW.MAPPING.FILE"


RUNNING LiDSiM
--------------
LiDSiM is executed by running the "lidsim" script.
For more details about the parameters run "lidsim -h"

It requires  at least 4 parameters:
1. The database to be evaluated   (-d/--database)
2. The NCBI gi-taxid mapping file (-i/--idmapping)
   (or preferably a pre-filtered file) 
3. The NCBI taxonomy nodes        (-n/--nodes)
4. A name for the result files    (-o/--output)

Additionally, you can indicate different modes of simulation, including
different error-tolerance (-t/--tolerance), a complete iterative evaluation of 
the database (default) or specific evaluations for particular organisms.

For instance, use 
"--mode=complete --extraction=genus --sampling=species --amount=1" 
to extract each genus in your database once, sample one representative of every
species in this genus and search their peptide against the remaining database.

Or use "--mode=specific --extraction=12345 --sampling=54321" 
to search the peptides of the organism with taxid 54321 in the database
excluding the taxonomic subtree with taxid 12345.

To extend the simulation with genomic data, you need to indicate a file with
de novo assembled contigs (-c/--contigs) and indicate the corresponding organism
with the specific mode. The contigs will be translated and annotated using Blast.
You can also add you own annotated data (-e/--extension). However, keep in mind
to exclude the corresponding organism from the database you may use in your
annotation workflow to stay consistent with LiDSiM idea of simulating the
absence of this organism in the target database.

A seed can be set to reproduce internal data selection/sampling. This is useful, 
e.g., to compare different conditions (e.g. exact, error-tolerant, +genome, ...)

The performance of LiDSiM can be adapted with parameters like
--threads, --memory, --diskmode and --peptides.


EXAMPLE DATA & RUN
------------------
You can download a set of example data from the project webpage:
https://sourceforge.net/projects/lidsim/

The file example-data.zip includes a database with the bacteria phylum 
"Deinococcus-Thermus" (taxid 1297), a corresponding gi-taxid mapping file, a
NCBI taxonomy nodes file and simulated contigs from Deinococcus deserti VCD115
(taxid 546414).

Example run with an iterative evaluation, extracting each species once:
lidsim -d example-data/bacteria_w1297.fasta -i example-data/phy1297_gi_taxid_prot.dmp -n example-data/nodes.dmp -o example-data/result_test

Example run with a specific evaluation and additional contigs, extracting the
Deinococcus genus and searching with Deinococcus deserti VCD115 peptides:
lidsim -d example-data/bacteria_w1297.fasta -i example-data/phy1297_gi_taxid_prot.dmp -n example-data/nodes.dmp -o example-data/result_test -c example-data/deserti_contigs.fasta  --mode=specific --extraction=1298 --sampling=546414


OUTPUT
------
The simulation exports two main result files:
(* donates the output name indicate with the parameter -o/--output)

*.ratios.txt - the taxonomic level/ranks ratios in a tab-delimited table
*.plot.pdf   - a plot of the ratios

The plot format depends on the number of extractions in the simulation:
1    - barplot with a bar per ranks
1-10 - barplot with with ranks stacked in one bar per extraction
>10  - heatmap with extractions per columns and rank ratios indicated by color


--------------------------------------------------------------------------------
Copyright (c) 2014, 
Mathias Kuhring, KuhringM@rki.de, Robert Koch Institute, Germany, 
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, 
are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * The name of the author may not be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL Mathias Kuhring BE LIABLE FOR ANY DIRECT, 
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.