Home modified by Thomas Weinmaier

Thomas Weinmaier — Wed, 12 Aug 2015 16:44:26 -0000

--- v2
+++ v3
@@ -1,4 +1,3 @@
-
 ConsPred 1.22
 ========

@@ -187,7 +186,8 @@
 Since ConsPred has multiple dependencies on external programs and requires downloading
 several large databases we prepared an Amazon Machine Image (AMI) that contains all the
 software and resources that are needed for a ConsPred annotation run. The AMI is called
-'ConsPred annotation framework' (AMI-ID ami-e3353fd3) and is available as a Community AMI.
+'ConsPred annotation framework' (AMI-ID ami-5fc5cf6f) and is available as a Community AMI
+in region US-West-2 (Oregon).

 To use the AMI one has to create an account at Amazon Web Services. The account can be
 created here: https://aws.amazon.com/?nc2=h_lg

Home modified by Thomas Weinmaier

Thomas Weinmaier — Wed, 12 Aug 2015 13:27:23 -0000

--- v1
+++ v2
@@ -1,8 +1,284 @@
-Welcome to your wiki!
-
-This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
-
-The wiki uses [Markdown](/p/conspred/wiki/markdown_syntax/) syntax.
-
-[[members limit=20]]
-[[download_button]]
+
+ConsPred 1.22
+========
+
+ConsPred is a prokaryotic genome annotation framework that performs various intrinsic
+gene predictions, homology searches, predictions of non-coding genes, and complex
+features and integrates all evidence into a consensus annotation. ConsPred achieves
+high-quality and comprehensive annotations based on rules and priorities, similar to
+decision making in manual curation. Parameters controlling the annotation process are
+configurable by the user and the framework can be easily extended and adapted to specific
+needs. The genome annotations are produced in formats ready for submission to public
+sequence archives.
+
+For questions and feedback please contact:
+        Thomas Weinmaier  ( thomas.weinmaier@univie.ac.at )
+        Alexander Platzer ( alexander.platzer@univie.ac.at )
+
+
+
+INTRODUCTION
+=====
+
+
+ConsPred implements the complete workflow for the annotation of a prokaryotic genome.
+It includes:
+   - prediction of protein-coding genes, 
+   - prediction of non protein-coding elements,
+   - functional annotation of proteins,
+   - prediction of conserved motifs
+   - mapping of protein-coding genes onto eggNOG functional categories
+   - mapping of protein-coding genes onto KEGG metabolic pathways
+   - integration of the various types of annotations
+
+The prediction of coding sequences (CDS) consists of two parts, the ab initio CDS
+prediction and the homology-based prediction.
+For the the ab initio CDS prediction multiple ab initio gene prediction tools are
+applied (currently Prodigal, GeneMark, Critica, Glimmer) and all predictions are
+collected. In the homology-based prediction, all open reading frames (ORFs) from
+the first possible start codon to stop codon are extracted and compared to the NCBI nr
+database of protein sequences. In order to avoid false positive CDS predictions based
+on spurious annotations derived from synteny in closely related genomes, an internal
+taxonomy filter is applied. The filter excludes all hits to closely related taxa
+– by default to the own Genus - from the homology search. The remaining alignments are
+used to trim the ORFs to the first start upstream of the alignment start. Overlaps
+between ORFs are resolved based on the number of database hits per ORF, removing
+so-called „Shadow-ORFs“, spurious ORFs without homology. All filtered ORFs are searched
+for putative pseudogenes and neighboring ORFs that have a hit to the same database
+sequence are reported as putative pseudogene.
+All ab initio predictions and the filtered conserved ORFs are grouped by stop coordinate
+and for every stop a consensus start position is selected following these rules:
+1) if only a homology-based prediction exists, the ORF start is selected
+2) if only ab initio predictions exist, the start that is supported by the highest number
+   of methods is selected; if there is a draw, the method with the highest priority wins
+   (default priority ranking from high to low: Prodigal, GeneMark, Critica, Glimmer)
+3) if homology-based and ab initio  predictions exist, first, all ab initio predictions
+   that are shorter than the homology-based prediction are discarded; if there are
+   remaining ab initio  predictions, start selection occurs according to rule 2; if no
+   ab initio prediction remains, the ORF start is selected
+
+Non-protein-coding elements (NCEs) are predicted, using the prediction tools RNAmmer,
+tRNAscan-SE, PilerCR and Infernal. Elements from which is known that they do not overlap
+with CDS are considered as blocking (rRNAs, tRNAs, CRISPR), others as non-blocking (ncRNAs).
+The coordinates of all blocking NCEs are used as „blacklist“ and all Consensus CDS that
+overlap blacklisted regions are discarded. All Consensus CDS that passed the filtering are
+used for functional annotation. For annotating the gene name, protein product and EC number,
+the filtered consensus CDS are first compared to the manually curated UniProt/SwissProt
+database and in a second step all CDS without significant hit are compared to the
+UniProt/UniRef90 database. Hits with e-value better or equal to 10-5 and a minimal coverage
+of both query and subject by the alignment of 70% are used for annotation transfer. This
+hierarchical approach allows to prioritize the annotations from the small but high-quality
+SwissProt database but also facilitates the much higher coverage of the UniRef90 database.
+For every transferred annotation the accession of the corresponding hit in the database is
+documented. Conserved domains are annotated by comparing the filtered consensus CDS  to the
+profiles in the InterPro signature databases using InterProScan.
+Additionally, the filtered consensus CDS are also compared to the Kegg and eggNOG databases
+and assignments to Kegg KOs, Kegg EC numbers, Kegg pathways and eggNOG functional categories
+are exported.
+Finally, all predicted elements and annotations are integrated to facilitate further analyses.
+
+
+
+PREREQUISITES
+====
+
+ConsPred produces a consensus prediction based on predictions from different input sources
+and performs comparisons against a number of databases. Therefore, the third-party prediction
+programs, their dependencies and and all databases have to be available for ConsPred to
+work properly.
+All software and resources needed by ConsPred are freely available for academic use.
+
+The required disk space for databases and software is about 150Gb for the current version.
+
+
+
+PREREQUISITES - THIRD-PARTY SOFTWARE
+------
+
+ConsPred needs the following external software to run:
+NCBI Blastall
+NCBI Blast+
+EMBOSS
+BioPerl
+additional perl packages: DBI, DBD::SQLite, XML::Simple
+java
+hmmer2
+hmmer3
+
+
+The following software is optional but highly recommended (it doesn't make much sense to make
+a consensus using only a single source)
+
+Glimmer
+GeneMark
+Prodigal
+Critica
+tRNAscan-SE
+rnammer
+pilerCR
+Infernal/Rfam
+InterProScan
+SignalP (used by InterProScan)
+TMHMM_2.0c (used by InterProScan)
+Phobius (used by InterProScan)
+
+
+For help with these programs see the suggestions in this package in the file INSTALL or
+just follow the original documentation for each tool.
+
+For having the license information of these tools at one place, these are in the subfolder 
+'licenses'. Of course there can be updates in licenses with newer versions.
+
+
+
+PREREQUISITES - DATABASES
+-------
+
+KEGG
+eggNOG
+Rfam
+NCBI Taxonomy database
+NCBI nr
+NCBI nt
+SwissProt
+UniRef90
+InterPro (comes with the software)
+
+a recent version of all databases and additional files necessary for ConsPred is
+available as a tarball from
+http://fileshare.csb.univie.ac.at/conspred_data/conspred_resources_LATEST.tar.gz
+
+Please download and extract the file in the desired download location using the commands:
+
+    wget http://fileshare.csb.univie.ac.at/conspred_data/conspred_resources_LATEST.tar.gz
+    tar xvfz conspred_resources_LATEST.tar.gz
+
+
+The download size is about 57Gb, so it will take some time. After extraction the databases
+take about 160Gb of disk space.
+
+
+
+INSTALLATION
+====
+
+The latest version of ConsPred can be downloaded from sourceforge:
+
+https://sourceforge.net/projects/conspred/files/
+
+
+and download it to the installation folder.
+Therein, execute the install script 'install.sh':
+For installing ConsPred using the local computer for calculations:
+    `./install.sh local`
+For installing ConsPred using a Sun Grid Engine on a server-cluster for calculations:
+     `./install.sh SGE`
+
+After that, the location and the platform of this installation is fixed.
+
+Go then through the config file ( config/conspredProperties.txt ) and adapt it to
+your system. This config file will be the template for each annotation run later.
+
+
+
+AMAZON CLOUD COMPUTING
+====
+
+Since ConsPred has multiple dependencies on external programs and requires downloading
+several large databases we prepared an Amazon Machine Image (AMI) that contains all the
+software and resources that are needed for a ConsPred annotation run. The AMI is called
+'ConsPred annotation framework' (AMI-ID ami-e3353fd3) and is available as a Community AMI.
+
+To use the AMI one has to create an account at Amazon Web Services. The account can be
+created here: https://aws.amazon.com/?nc2=h_lg
+
+ConsPred is installed as 'local' version in the AMI that means only the computer that is
+running the AMI is used for calculations. Therefore, an instance-type with many CPUs (16 
+or 32) should be used.
+
+
+
+APPLICATION
+====
+
+Make a new folder where the annotation should happen. Think of a taxonomy branch which should
+be filtered in homology search (this branch should of course contain the actual genome;
+for details in choosing the taxonomy branch see the INTRODUCTION).
+
+Prepare the annotation with:
+
+    <absolute path="" to="" ConsPred="">/conspred.sh -i <handle> <taxonomy> <filename>
+
+handle     A short name (only letters and numbers are allowed for the name) for the 
+            organism, which is used throughout the annotation
+taxonomy   The NCBI Taxonomy-ID or the name of the genus for internal taxonomy filtering, 
+            if set too high in the classification it is ignored (e.g. 2 for Bacteria),
+            which means no taxonomy filtering then. You can also prevent taxonomy 
+            filtering with setting this parameter to '0'.
+filename   The absolute path to the DNA multiple fasta file to be annotated
+
+
+Check the config file config/conspredProperties.txt if everything is set as it should
+run (the template of this file is generated at the installation).
+Beside the config file, there is also a specification created -> conspred_input_specification.txt.
+This files contains some details for the following annotation run. Be aware that in these details
+are also the thresholds for the plausibility checks (minimal numbers of CDS, rRNA and t-RNA). If one
+of the checks fail then the annotation run is terminated, so in case you have not a real or full
+sequence to annotate, you should/could set all this thresholds to 0.
+
+Start the annotation with 
+
+    <absolute path="" to="" ConsPred="">/conspred.sh -a
+
+Depending on the genome size and the computational resources the annotation takes few hours to
+several days without user intervention. Thus, it is recommended to use an own bash session for that
+(e.g. with the screen command).
+The main script is also blocking when a grid engine is used, which is useful to see when the run
+is finished. If you use a grid engine, especially when submitted the main script as a job, check 
+carefully if the synchronization of jobs is working with your computer cluster system (= primarily
+if the value/command of 'CLUSTERcommandWAITforALLjobs' in the config file is doing what is expected).
+Be aware, although the main script is executed blocking, if you terminate the main script (e.g.
+CTRL-C in a terminal), this doesn't mean that all cluster jobs are terminated from that (= you need 
+to terminate the cluster jobs as well, after terminating the main script).
+
+
+At the end there are a lot of intermediate files which might be interesting, but the 
+main output is in analyses/17genbankFile/ -> *consensus.gff, *csv, *gbk, *.embl; those four files
+represent the consensus annotation in different formats.
+
+If there are any problems or if you like to be sure, look into the log files at specificSteps/*out
+These can be also scanned with 'egrep -wi --color 'warning|error|problem|found|critical' *.out'.
+
+
+
+EXAMPLE DATA
+====
+
+an example sequence for annotation is in the subfolder testing/inputSequence/ 
+It is highly recommended to make a ConsPred-run on this sequence first to see if all paths are set
+and all dependencies are met correctly; this demo-run shouldn't take much more than 4 hours.
+
+
+
+TODO
+====
+- prepare a tutorial for use with Amazon cloud computing
+- evaluate RAPSearch and Diamond as a replacement for blast
+- automatically adjust protein products for submission to NCBI  
+
+
+
+REFERENCE
+====
+
+Thomas Weinmaier, Alexander Platzer, Hans-Jörg Hellinger, Patrick Tischler and
+Thomas Rattei. ConsPred – a rule-based (re-)annotation framework for prokaryotic genomes (submitted)
+
+
+
+LICENSE
+====
+
+https://creativecommons.org/licenses/by/4.0/
+
+

Home modified by Thomas Weinmaier

Thomas Weinmaier — Sat, 16 May 2015 11:31:31 -0000

Welcome to your wiki!

This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].

The wiki uses Markdown syntax.

Project Members:

Thomas Weinmaier (admin)

Recent changes to Home

Home modified by Thomas Weinmaier

Home modified by Thomas Weinmaier

Home modified by Thomas Weinmaier

Project Members: