Read Me
ConsPred 1.34
ConsPred is a prokaryotic genome annotation framework that performs various intrinsic
gene predictions, homology searches, predictions of non-coding genes, and complex
features and integrates all evidence into a consensus annotation. ConsPred achieves
high-quality and comprehensive annotations based on rules and priorities, similar to
decision making in manual curation. Parameters controlling the annotation process are
configurable by the user and the framework can be easily extended and adapted to specific
needs. The genome annotations are produced in formats ready for submission to public
sequence archives.
For questions and feedback please contact:
Thomas Weinmaier ( thomas.weinmaier@univie.ac.at )
Alexander Platzer ( alexander.platzer@univie.ac.at )
INTRODUCTION
ConsPred implements the complete workflow for the annotation of a prokaryotic genome.
It includes:
- prediction of protein-coding genes,
- prediction of non protein-coding elements,
- functional annotation of proteins,
- prediction of conserved motifs
- mapping of protein-coding genes onto eggNOG functional categories
- mapping of protein-coding genes onto KEGG metabolic pathways
- integration of the various types of annotations
The prediction of coding sequences (CDS) consists of two parts, the ab initio CDS
prediction and the homology-based prediction.
For the the ab initio CDS prediction multiple ab initio gene prediction tools are
applied (currently Prodigal, GeneMark, Critica, Glimmer) and all predictions are
collected. In the homology-based prediction, all open reading frames (ORFs) from
the first possible start codon to stop codon are extracted and compared to the NCBI nr
database of protein sequences. In order to avoid false positive CDS predictions based
on spurious annotations derived from synteny in closely related genomes, an internal
taxonomy filter is applied. The filter excludes all hits to closely related taxa
– by default to the own Genus - from the homology search. The remaining alignments are
used to trim the ORFs to the first start upstream of the alignment start. Overlaps
between ORFs are resolved based on the number of database hits per ORF, removing
so-called „Shadow-ORFs“, spurious ORFs without homology. All filtered ORFs are searched
for putative pseudogenes and neighboring ORFs that have a hit to the same database
sequence are reported as putative pseudogene.
All ab initio predictions and the filtered conserved ORFs are grouped by stop coordinate
and for every stop a consensus start position is selected following these rules:
1) if only a homology-based prediction exists, the ORF start is selected
2) if only ab initio predictions exist, the start that is supported by the highest number
of methods is selected; if there is a draw, the method with the highest priority wins
(default priority ranking from high to low: Prodigal, GeneMark, Critica, Glimmer)
3) if homology-based and ab initio predictions exist, first, all ab initio predictions
that are shorter than the homology-based prediction are discarded; if there are
remaining ab initio predictions, start selection occurs according to rule 2; if no
ab initio prediction remains, the ORF start is selected
Non-protein-coding elements (NCEs) are predicted, using the prediction tools RNAmmer,
tRNAscan-SE, PilerCR and Infernal. Elements from which is known that they do not overlap
with CDS are considered as blocking (rRNAs, tRNAs, CRISPR), others as non-blocking (ncRNAs).
The coordinates of all blocking NCEs are used as „blacklist“ and all Consensus CDS that
overlap blacklisted regions are discarded. All Consensus CDS that passed the filtering are
used for functional annotation. For annotating the gene name, protein product and EC number,
the filtered consensus CDS are first compared to the manually curated UniProt/SwissProt
database and in a second step all CDS without significant hit are compared to the
UniProt/UniRef90 database. Hits with e-value better or equal to 10-5 and a minimal coverage
of both query and subject by the alignment of 70% are used for annotation transfer. This
hierarchical approach allows to prioritize the annotations from the small but high-quality
SwissProt database but also facilitates the much higher coverage of the UniRef90 database.
For every transferred annotation the accession of the corresponding hit in the database is
documented. Conserved domains are annotated by comparing the filtered consensus CDS to the
profiles in the InterPro signature databases using InterProScan.
Additionally, the filtered consensus CDS are also compared to the Kegg and eggNOG databases
and assignments to Kegg KOs, Kegg EC numbers, Kegg pathways and eggNOG functional categories
are exported.
Finally, all predicted elements and annotations are integrated to facilitate further analyses.
ConsPred is a pipeline performing all the descriped tasks in a single run. It uses for that
several external tools which are descriped in the prerequisites section. One core module
called ConsPred2 is only present as binary, the source code is present as own project with
this name. The reason for the name 'ConsPred2' is not for versioning, it's for the broader
scope, as only parts of the capabilities are used in the pipeline ConsPred.
PREREQUISITES
ConsPred produces a consensus prediction based on predictions from different input sources
and performs comparisons against a number of databases. Therefore, the third-party prediction
programs, their dependencies and and all databases have to be available for ConsPred to
work properly.
All software and resources needed by ConsPred are freely available for academic use.
The required disk space for databases and software is about 150Gb for the current version.
PREREQUISITES - THIRD-PARTY SOFTWARE
ConsPred needs the following external software to run:
NCBI Blastall
NCBI Blast+
EMBOSS
BioPerl
additional perl packages: DBI, DBD::SQLite, XML::Simple
java
hmmer2
hmmer3
The following software is optional but highly recommended (it doesn't make much sense to make
a consensus using only a single source)
Glimmer
GeneMark
Prodigal
Critica
tRNAscan-SE
rnammer
pilerCR
Infernal/Rfam
InterProScan
SignalP (used by InterProScan)
TMHMM_2.0c (used by InterProScan)
Phobius (used by InterProScan)
For help with these programs see the suggestions in this package in the file INSTALL or
just follow the original documentation for each tool.
For having the license information of these tools at one place, these are in the subfolder
'licenses'. Of course there can be updates in licenses with newer versions.
PREREQUISITES - DATABASES
KEGG
eggNOG
Rfam
NCBI Taxonomy database
NCBI nr
NCBI nt
SwissProt
UniRef90
InterPro (comes with the software)
a recent version of all databases and additional files necessary for ConsPred is
available as a tarball from
http://fileshare.csb.univie.ac.at/conspred_data/conspred_resources_LATEST.tar.gz
Please download and extract the file in the desired download location using the commands:
wget http://fileshare.csb.univie.ac.at/conspred_data/conspred_resources_LATEST.tar.gz
tar xvfz conspred_resources_LATEST.tar.gz
The download size is about 57Gb, so it will take some time. After extraction the databases
take about 160Gb of disk space.
INSTALLATION
The latest version of ConsPred can be downloaded from sourceforge:
https://sourceforge.net/projects/conspred/files/
and download it to the installation folder.
Therein, execute the install script 'install.sh':
For installing ConsPred using the local computer for calculations:
./install.sh local
For installing ConsPred using a Sun Grid Engine on a server-cluster for calculations:
./install.sh SGE
After that, the location and the platform of this installation is fixed.
Go then through the config file ( config/conspredProperties.txt ) and adapt it to
your system. This config file will be the template for each annotation run later.
AMAZON CLOUD COMPUTING
Since ConsPred has multiple dependencies on external programs and requires downloading
several large databases we prepared an Amazon Machine Image (AMI) that contains all the
software and resources that are needed for a ConsPred annotation run. The AMI is called
'ConsPred annotation framework' and is available as a Community AMI in region US-West-2
(Oregon). The most recent AMI-ID can be found on the ConsPred SourceForge Wiki page
http://sourceforge.net/p/conspred/wiki/Home/
To use the AMI one has to create an account at Amazon Web Services. The account can be
created here: https://aws.amazon.com/?nc2=h_lg
ConsPred is installed as 'local' version in the AMI that means only the computer that is
running the AMI is used for calculations. Therefore, an instance-type with many CPUs (16
or 32) should be used.
APPLICATION
Make a new folder where the annotation should happen. Think of a taxonomy branch which should
be filtered in homology search (this branch should of course contain the actual genome;
for details in choosing the taxonomy branch see the INTRODUCTION).
Prepare the annotation with:
<absolute path to ConsPred>/conspred.sh -i <handle> <taxonomy> <filename>
handle A short name (only letters and numbers are allowed for the name) for the
organism, which is used throughout the annotation
taxonomy The NCBI Taxonomy-ID or the name of the genus for internal taxonomy filtering,
if set too high in the classification it is ignored (e.g. 2 for Bacteria),
which means no taxonomy filtering then. If set too low, it is elevated to genus.
You can also prevent taxonomy filtering with setting this parameter to '0'.
filename The absolute path to the DNA multiple fasta file to be annotated
Check the config file config/conspredProperties.txt if everything is set as it should
run (the template of this file is generated at the installation).
Beside the config file, there is also a specification created -> conspred_input_specification.txt.
This files contains some details for the following annotation run. Be aware that in these details
are also the thresholds for the plausibility checks (minimal numbers of CDS, rRNA and t-RNA). If one
of the checks fail then the annotation run is terminated, so in case you have not a real or full
sequence to annotate, you should/could set all this thresholds to 0.
Start the annotation with
<absolute path to ConsPred>/conspred.sh -a
Depending on the genome size and the computational resources the annotation takes few hours to
several days without user intervention. Thus, it is recommended to use an own bash session for that
(e.g. with the screen command).
The main script is also blocking when a grid engine is used, which is useful to see when the run
is finished. If you use a grid engine, especially when submitted the main script as a job, check
carefully if the synchronization of jobs is working with your computer cluster system (= primarily
if the value/command of 'CLUSTERcommandWAITforALLjobs' in the config file is doing what is expected).
Be aware, although the main script is executed blocking, if you terminate the main script (e.g.
CTRL-C in a terminal), this doesn't mean that all cluster jobs are terminated from that (= you need
to terminate the cluster jobs as well, after terminating the main script).
At the end there are a lot of intermediate files which might be interesting, but the
main output is in analyses/17genbankFile/ -> *consensus.gff, *csv, *gbk, *.embl; those four files
represent the consensus annotation in different formats.
If there are any problems or if you like to be sure, look into the log files at specificSteps/*out
These can be also scanned with 'egrep -wi --color 'warning|error|problem|found|critical' *.out'.
EXAMPLE DATA
an example sequence for annotation is in the subfolder testing/inputSequence/
It is highly recommended to make a ConsPred-run on this sequence first to see if all paths are set
and all dependencies are met correctly; this demo-run shouldn't take much more than 4 hours.
REFERENCE
Thomas Weinmaier, Alexander Platzer, Jeroen Frank, Hans-Jörg Hellinger, Patrick Tischler and
Thomas Rattei. ConsPred – a rule-based (re-)annotation framework for prokaryotic genomes (submitted)
LICENSE
https://creativecommons.org/licenses/by/4.0/