SuRankCo: Supervised Ranking of Contigs in de novo Assemblies
Author: Mathias Kuhring
Contact: KuhringM@rki.de
UPDATE
------
Jun2015:
* Minor changes to enable BAM support, too.
Feb2015:
* Added support for FASTA/SAM assemblies in addition to ACE/FASTQ(QUAL).
NOTE: features of FASTA/SAM assemblies will not include BaseCount,
BaseSeqmentCount and ContigQualities, yet.
INTRODUCTION
------------
SuRankCo is a machine learning based software to score and rank contigs from de
novo assemblies of next generation sequencing data. It trains with alignments of
contigs with known reference genomes and predicts scores and ranking for contigs
which have no related reference genome yet.
For more details about SuRankCo and its functioning, please see
"SuRankCo: Supervised Ranking of Contigs in de novo Assemblies"
Mathias Kuhring, Piotr Wojtek Dabrowski, Andreas Nitsche and Bernhard Y. Renard
(Submitted manuscript)
PLEASE NOTE, it is recommended to read the paper and this readme.txt file before
using SuRankCo.
INSTRUCTIONS
------------
SuRankCo consists of four modules:
- surankco-feature: feature generation from contigs (ACE or FASTA) and
corresponding reads (QUAL, QUA, FASTQ resp. SAM or BAM)
- surankco-score: alignment of unpadded(!) contigs (FASTA format) and
reference genomes (FASTA format) and score calculation
- surankco-training: training of random forest using contig features (from
surankco-feature) and contig scores (surankco-score)
- surankco-prediction: prediction of scores and ranking of contigs using
trained random forests (from surankco-training) and
contig features (from surankco-feature)
Different combinations of these modules allow for two workflows, training and
prediction. For the training, use surankco-feature and surankco-score with the
same contigs followed by surankco-training. For the prediction, use
surankco-feature with new contigs followed by surankco-prediction.
SYSTEM REQUIREMENTS
-------------------
The following software and libraries are required to run SuRankCo:
- Java 7
(Source: http://www.java.com)
- GNU R 3 including Rscript
(Source: http://www.r-project.org/)
- additional GNU R Packages: optparse, MASS, randomForest
(Please refer to the R manuals on how to install packages)
- Blat and accompanying pslPretty
(Source: http://hgdownload.cse.ucsc.edu/admin/exe/)
All required executables (java, Rscript, blat and pslPretty) have to be
available in the PATH environment variable. Please refer to the operating
systems manual on how to set up the PATH variable, if manual setup is necessary.
NOTE: For Mac/OSX the R version 3.0.3 is recommended. The necessary package
randomForest is not yet compatible to the latest version 3.1
(please refer to the troubleshooting section for more details).
USER INTERFACE & EXECUTION
--------------------------
SuRankCo is a command-line based software with four executables equivalent to
the four modules. Use the -h flag for more parameter details. surankco-feature,
surankco-score and surankco-training can process the contigs of several
assemblies at once. Thus a list of applicable files or a directory can be
indicated. The directory will be scanned for all files that have an applicable
suffix. Input file examples are provided below.
surankco-feature
- Expects pairs of either ACE and FASTQ/QUAL/QUA or FASTA and SAM/BAM files.
You provide either a list of ACE or FASTA files or a directory with such.
NOTE: features of FASTA/SAM/BAM assemblies will not include BaseCount,
BaseSeqmentCount and ContigQualities, yet.
- The expected suffix for the contig FASTAs is ".contigs.fasta" to prevent
confusion with reference genome fasta files.
- FASTQ/QUAL/QUA or SAM/BAM files are automatically detected and expected to
share the same prefix such as their corresponding ACE or FASTA files, resp.
- For ACE: QUAL (default), QUA or FASTQ format can be selected per parameter as well
as the FASTQ version (default: Illumina1.8+).
- The FASTQ version is alternatively auto detectable, but the identification is
not (!) 100% reliable and thus uncertainties will stop the process and
suggestions for manual settings will be provided.
- Some de novo assembler may extend the read names with extra information. A
Java compatible regular expression can be indicated to cut of the extensions
to enable matching between ACE and QUAL/FASTQ reads. E.g., a Newbler read
"myread.000089727.26-146.fm1165.to1077" may by trimmed to the original name
"myread.000089727" by using the regex "\\\\.\\\\d+-".
NOTE, if a backslash "\" is needed use "\\\\"!
- Some feature generation steps can be executed in parallel if more than one
assembly (resp. ACE or FASTA file) is processed. Thus, a number of threads
can be indicated.
- The memory used by the Java virtual machine can be adjusted, e.g. if the
default (32 GB) is too high for the system or too low for big assemblies.
- The feature "Genome Relation" requires an expected genome size. It can be
indicated or roughly estimated as the sum of contig lengths.
-> The output is a set of TXT files (one per input ACE or FASTA file)
containing the calculated features in a table like format. The files are
named with same paths and prefixes as the corresponding assembly files and
the suffix ".features.txt".
surankco-score
- Expects a list of contig FASTA files or a directory with FASTA files.
- The expected suffix for the contig FASTAs is ".contigs.fasta" (variable).
- Reference genome FASTAs are automatically detected and expected to share the
same prefix as corresponding contig FASTAs followed by the suffix
".ref.fasta" (variable).
- The memory used by the Java virtual machine can be adjusted, e.g. if the
default (32 GB) is too high for the system or too low for big assemblies.
-> The output is a set of TXT files (one per input contig FASTA file)
containing the calculated scores in a table like format. The files are named
with same paths and prefixes as the corresponding FASTA files and the suffix
".scores.txt". Additionally, a PDF file with histograms per score calculated
over all processed contigs is exported to support the class separation
threshold selection in surankco-training. The default filename
"surankco_score_histograms.pdf" can be customized.
(Please refer to the indicated paper for details about the thresholds)
surankco-training
- Expects a list of SuRankCo feature files or a directory with feature files.
- The expected suffix for the feature files is ".features.txt".
- Corresponding SuRankCo score files are automatically detected and expected to
share the same prefix as the corresponding feature files followed by the
suffix ".scores.txt".
- For the class separation, manual thresholds per score or a quantile for
automatic threshold selection with exponential fittings can be indicated.
Manual thresholds need range [0,1], except for threshold 5 and 7
(MaxRegionError and MaxEndErrorCount) which need range [0,100]. Most scores
get separated into <= or > of the threshold, except for NormedMatchCount1,
NormedMatchCount2 and NormedContigLength1 which use >= or <.
The exponential quantile needs the range [0,1] and uses <= or > in all cases,
since NormedMatchCount1, NormedMatchCount2 and NormedContigLength1 have to be
reversed anyway to enable exponential fittings.
-> The output is an R data file containing the trained random forests.
The default file name is "surankco_rfs.RData" but can be changed.
surankco-prediction
- Expects one SuRankCo feature file with suffix ".features.txt".
- Expects a R data file with a random forest trained by surankco-training and
the suffix ".RData".
-> The output is a table formatted text file (named "surankco_results.txt" per
default, but can be changed), sorted by decreasing scores.
The file contains four columns:
"Assembly" : the filename of the corresponding assembly file
"ReadQuality" : the filename of the corresponding read quality file
"ContigID" : the contig id as in the corresponding assembly and
feature file
"SurankcoContigScore" : the final SuRankCo contig score
INPUT FILE & EXECUTION EXAMPLES
-------------------------------
The following file lists illustrate the required file combinations and suffixes.
Training
surankco-feature:
assembly1.ace, assembly2.ace, assemblyXYZ.ace
assembly1.qual, assembly2.qual, assemblyXYZ.qual
surankco-score:
assembly1.contigs.fasta, assembly2.contigs.fasta, assemblyXYZ.contigs.fasta
assembly1.ref.fasta, assembly2.ref.fasta, assemblyXYZ.ref.fasta
surankco-training:
assembly1.features.txt, assembly2.features.txt, assemblyXYZ.features.txt
assembly1.scores.txt, assembly2.scores.txt, assemblyXYZ.scores.txt
Prediction
surankco-feature:
assembly_new.ace
assembly_new.qual
surankco-prediction:
assembly_new.txt
surankco_rfs.RData
The following calls show example parameter uses (short flags and long flags).
The example calls assume all files to be in the execution directory (-a) or in
the indicated directory (-d). However, full pathnames can be indicated.
Note, file list and directory parameters are mutually exclusive but indicating
one of them is mandatory. Other parameters are optional due to default values.
manual.thresholds has higher priority than exponential.quantile when indicated.
surankco-feature
-a assembly1.ace,assembly2.ace,assemblyXYZ.ace
-d /home/user/mydata/
-r qual
-q illumina18
-s ////.////d+-
-t 4
-m 32
-k
-g 1234567,2000000,111111
--assemblies=assembly1.ace,assembly2.ace,assemblyXYZ.ace
--directory=/home/user/mydata/
--read.quality.format=qual
--fastq.version=illumina18
--split.regex=////.////d+-
--threads=4
--memory=32
--kmer.features
--expected.genome.size=1234567,2000000,111111
surankco-score
-a assembly1.contigs.fasta,assembly2.contigs.fasta,assemblyXYZ.contigs.fasta
-d /home/user/mydata/
-f contigs.fasta
-r ref.fasta
-p /home/user/mydata/surankco_score_histograms.pdf
-m 32
--assemblies=assembly1.contigs.fasta,assembly2.contigs.fasta,assemblyXYZ.contigs.fasta
--directory=/home/user/mydata/
--assembly.suffix=contigs.fasta
--reference.suffix=ref.fasta
--pdf.histograms=/home/user/mydata/surankco_score_histograms.pdf
--memory=32
surankco-training
-f assembly1.features.txt,assembly2.features.txt,assemblyXYZ.features.txt
-d /home/user/mydata/
-o /home/user/mydata/mydata_rfs.RData
-e 0.5
-m 0.95,0.95,0.05,0.05,5,0.05,5,0.95
--features=assembly1.features.txt,assembly2.features.txt,assemblyXYZ.features.txt
--directory=/home/user/mydata/
--output.filename=/home/user/mydata/mydata_rfs.RData
--exponential.quantile=0.5
--manual.thresholds=0.95,0.95,0.05,0.05,5,0.05,5,0.95
surankco-prediction
-f assembly_new.txt
-r surankco_rfs.RData
-o my_results.txt
--features=assembly_new.txt
--random.forests=surankco_rfs.RData
--output.filename=my_results.txt
TROUBLESHOOTING
---------------
P1: [OSX] Error loading package "randomForest" in R 3.1
Error in dyn.load(file, DLLpath = DLLpath, ...) :
kann shared object '/Library/Frameworks/R.framework/Versions/3.1/Resources/
library/randomForest/libs/randomForest.so' nicht laden:
dlopen(/Library/Frameworks/R.framework/Versions/3.1/Resources/library/
randomForest/libs/randomForest.so, 6): Library not loaded: /Library/
Frameworks/R.framework/Versions/3.0/Resources/lib/libgfortran.2.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.1/Resources/
library/randomForest/libs/randomForest.so
Reason: image not found
Fehler: Laden von Paket oder Namensraum für ‘randomForest’ fehlgeschlagen
randomForest tries to load external libraries of older R Version 3.0.*.
However, providing them by installing R 3.0.* in parallel leads to
incompatibilities (see next problem P2). Therefore, using R 3.0.3 solely is
recommended for OSX.
P2: [OSX] Memory error running suranco-training with R 3.1
*** caught segfault ***
address 0x18, cause 'memory not mapped'
R 3.1 doesn't provide the required versions of external libraries for the
randomForest package (see previous problem P1). Providing the libraries of
older R Versions leads to incompatibilities with R 3.1 and seems to produce a
segfault. Therefore, using R 3.0.3 solely is recommended for OSX.
P3: Permission denied when executing surankco, e.g.:
-bash: /home/kuhringm/workspace/surankco/surankco-score: Permission denied
The surankco scripts might have lost their execution permission. In a terminal,
go to the suranko directory and run "chmod +x surankco-feature surankco-score
surankco-training surankco-prediction".
P4: suranco-score terminates with following Permission denied error:
sh: 1: /home/kuhringm/workspace/surankco/r/pslMatchFilter: Permission denied
Error: pslMatchFilter not successfully executed: 126
Execution halted
pslMatchFilter needs execution permissions. In a terminal, go to the suranko
directory and run "chmod +x r/pslMatchFilter".
P4: surankco-score prints dexp/NaN warning messages:
Warning messages:
1: In dexp(x, estimate, log = TRUE) : NaNs produced
2: In dexp(x, rate = fit$estimate) : NaNs produced
Some scores don't have enough variance to fit an exponential distribution. This
is not critical in surankco-score, but might be a problem for the training
(see problem P6).
P5: surankco-training prints dexp/NaN warning messages:
Warning messages:
1: In dexp(x, estimate, log = TRUE) : NaNs produced
Some scores don't have enough variance to fit an exponential distribution.
The training will not be possible (see problem P6).
P6: surankco-training terminates with following training error
Error in randomForest.default(x = input[index, ], y = targets[index, i], :
Need at least two classes to do classification.
Calls: rfTraining -> randomForest -> randomForest.default
Execution halted
Some scores don't provide two classes. Either the variance of these scores is
indeed to low or the thresholds for class separation are poorly chosen. The
variance might be increased by providing more contigs for the training.
--------------------------------------------------------------------------------
Copyright (c) 2014,
Mathias Kuhring, KuhringM@rki.de, Robert Koch Institute, Germany,
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* The name of the author may not be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL Mathias Kuhring BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.