Home
Name Modified Size InfoDownloads / Week
readme.txt 2015-06-08 16.5 kB
surankco_r5.tar.gz 2015-06-08 1.1 MB
surankco_r4.tar.gz 2015-02-23 1.1 MB
surankco_r1.tar.gz 2014-05-09 152.5 kB
Totals: 4 Items   2.3 MB 0
SuRankCo: Supervised Ranking of Contigs in de novo Assemblies

Author:  Mathias Kuhring
Contact: KuhringM@rki.de


UPDATE
------
Jun2015:
* Minor changes to enable BAM support, too.

Feb2015:
* Added support for FASTA/SAM assemblies in addition to ACE/FASTQ(QUAL).
  NOTE: features of FASTA/SAM assemblies will not include BaseCount, 
  BaseSeqmentCount and ContigQualities, yet.


INTRODUCTION
------------

SuRankCo is a machine learning based software to score and rank contigs from de
novo assemblies of next generation sequencing data. It trains with alignments of
contigs with known reference genomes and predicts scores and ranking for contigs 
which have no related reference genome yet.

For more details about SuRankCo and its functioning, please see

 "SuRankCo: Supervised Ranking of Contigs in de novo Assemblies"
 Mathias Kuhring, Piotr Wojtek Dabrowski, Andreas Nitsche and Bernhard Y. Renard
 (Submitted manuscript)

PLEASE NOTE, it is recommended to read the paper and this readme.txt file before 
using SuRankCo.

	
INSTRUCTIONS
------------
SuRankCo consists of four modules: 
- surankco-feature:     feature generation from contigs (ACE or FASTA) and 
                        corresponding reads (QUAL, QUA, FASTQ resp. SAM or BAM)
- surankco-score:       alignment of unpadded(!) contigs (FASTA format) and 
                        reference genomes (FASTA format) and score calculation
- surankco-training:    training of random forest using contig features (from 
                        surankco-feature) and contig scores (surankco-score)
- surankco-prediction:  prediction of scores and ranking of contigs using 
                        trained random forests (from surankco-training) and 
                        contig features (from surankco-feature)

Different combinations of these modules allow for two workflows, training and 
prediction. For the training, use surankco-feature and surankco-score with the 
same contigs followed by surankco-training. For the prediction, use 
surankco-feature with new contigs followed by surankco-prediction.


SYSTEM REQUIREMENTS
-------------------
The following software and libraries are required to run SuRankCo:
- Java 7 
  (Source: http://www.java.com)
- GNU R 3 including Rscript 
  (Source: http://www.r-project.org/)
- additional GNU R Packages: optparse, MASS, randomForest 
  (Please refer to the R manuals on how to install packages)
- Blat and accompanying pslPretty
  (Source: http://hgdownload.cse.ucsc.edu/admin/exe/)

All required executables (java, Rscript, blat and pslPretty) have to be 
available in the PATH environment variable. Please refer to the operating 
systems manual on how to set up the PATH variable, if manual setup is necessary.

NOTE: For Mac/OSX the R version 3.0.3 is recommended. The necessary package 
randomForest is not yet compatible to the latest version 3.1
(please refer to the troubleshooting section for more details).


USER INTERFACE & EXECUTION
--------------------------
SuRankCo is a command-line based software with four executables equivalent to 
the four modules. Use the -h flag for more parameter details. surankco-feature, 
surankco-score and surankco-training can process the contigs of several 
assemblies at once. Thus a list of applicable files or a directory can be 
indicated. The directory will be scanned for all files that have an applicable 
suffix. Input file examples are provided below.

surankco-feature
 - Expects pairs of either ACE and FASTQ/QUAL/QUA or FASTA and SAM/BAM files. 
   You provide either a list of ACE or FASTA files or a directory with such.
   NOTE: features of FASTA/SAM/BAM assemblies will not include BaseCount, 
   BaseSeqmentCount and ContigQualities, yet.
 - The expected suffix for the contig FASTAs is ".contigs.fasta" to prevent
   confusion with reference genome fasta files.
 - FASTQ/QUAL/QUA or SAM/BAM files are automatically detected and expected to 
   share the same prefix such as their corresponding ACE or FASTA files, resp. 
 - For ACE: QUAL (default), QUA or FASTQ format can be selected per parameter as well 
   as the FASTQ version (default: Illumina1.8+).
 - The FASTQ version is alternatively auto detectable, but the identification is 
   not (!) 100% reliable and thus uncertainties will stop the process and
   suggestions for manual settings will be provided.
 - Some de novo assembler may extend the read names with extra information. A 
   Java compatible regular expression can be indicated to cut of the extensions 
   to enable matching between ACE and QUAL/FASTQ reads. E.g., a Newbler read 
   "myread.000089727.26-146.fm1165.to1077" may by trimmed to the original name 
   "myread.000089727" by using the regex "\\\\.\\\\d+-".
   NOTE, if a backslash "\" is needed use "\\\\"!
 - Some feature generation steps can be executed in parallel if more than one
   assembly (resp. ACE or FASTA file) is processed. Thus, a number of threads  
   can be indicated.
 - The memory used by the Java virtual machine can be adjusted, e.g. if the
   default (32 GB) is too high for the system or too low for big assemblies.
 - The feature "Genome Relation" requires an expected genome size. It can be 
   indicated or roughly estimated as the sum of contig lengths.
 
 -> The output is a set of TXT files (one per input ACE or FASTA file) 
    containing the calculated features in a table like format. The files are  
	named with same paths and prefixes as the corresponding assembly files and  
	the suffix ".features.txt".

surankco-score
 - Expects a list of contig FASTA files or a directory with FASTA files. 
 - The expected suffix for the contig FASTAs is ".contigs.fasta" (variable).
 - Reference genome FASTAs are automatically detected and expected to share the 
   same prefix as corresponding contig FASTAs followed by the suffix 
   ".ref.fasta" (variable).
 - The memory used by the Java virtual machine can be adjusted, e.g. if the
   default (32 GB) is too high for the system or too low for big assemblies.

 -> The output is a set of TXT files (one per input contig FASTA file) 
    containing the calculated scores in a table like format. The files are named 
    with same paths and prefixes as the corresponding FASTA files and the suffix 
    ".scores.txt". Additionally, a PDF file with histograms per score calculated 
    over all processed contigs is exported to support the class separation 
    threshold selection in surankco-training. The default filename
	"surankco_score_histograms.pdf" can be customized.
    (Please refer to the indicated paper for details about the thresholds)

surankco-training
 - Expects a list of SuRankCo feature files or a directory with feature files.
 - The expected suffix for the feature files is ".features.txt".
 - Corresponding SuRankCo score files are automatically detected and expected to 
   share the same prefix as the corresponding feature files followed by the 
   suffix ".scores.txt".
 - For the class separation, manual thresholds per score or a quantile for 
   automatic threshold selection with exponential fittings can be indicated.
   Manual thresholds need range [0,1], except for threshold 5 and 7 
   (MaxRegionError and MaxEndErrorCount) which need range [0,100]. Most scores 
   get separated into <= or > of the threshold, except for NormedMatchCount1, 
   NormedMatchCount2 and NormedContigLength1 which use >= or <.
   The exponential quantile needs the range [0,1] and uses <= or > in all cases,
   since NormedMatchCount1, NormedMatchCount2 and NormedContigLength1 have to be
   reversed anyway to enable exponential fittings.
 
 -> The output is an R data file containing the trained random forests. 
    The default file name is "surankco_rfs.RData" but can be changed.
 
surankco-prediction
 - Expects one SuRankCo feature file with suffix ".features.txt".
 - Expects a R data file with a random forest trained by surankco-training and 
   the suffix ".RData".

 -> The output is a table formatted text file (named "surankco_results.txt" per 
    default, but can be changed), sorted by decreasing scores.
	
    The file contains four columns:
    "Assembly"	          : the filename of the corresponding assembly file
    "ReadQuality"	      : the filename of the corresponding read quality file
    "ContigID"	          : the contig id as in the corresponding assembly and 
                            feature file
    "SurankcoContigScore" : the final SuRankCo contig score


INPUT FILE & EXECUTION EXAMPLES
-------------------------------
The following file lists illustrate the required file combinations and suffixes.

Training
surankco-feature:
	assembly1.ace, assembly2.ace, assemblyXYZ.ace
	assembly1.qual, assembly2.qual, assemblyXYZ.qual
surankco-score:
	assembly1.contigs.fasta, assembly2.contigs.fasta, assemblyXYZ.contigs.fasta
	assembly1.ref.fasta, assembly2.ref.fasta, assemblyXYZ.ref.fasta
surankco-training:
	assembly1.features.txt, assembly2.features.txt, assemblyXYZ.features.txt
	assembly1.scores.txt, assembly2.scores.txt, assemblyXYZ.scores.txt

Prediction
surankco-feature:
	assembly_new.ace 
	assembly_new.qual
surankco-prediction:
	assembly_new.txt
	surankco_rfs.RData

The following calls show example parameter uses (short flags and long flags).
The example calls assume all files to be in the execution directory (-a) or in 
the indicated directory (-d). However, full pathnames can be indicated.
Note, file list and directory parameters are mutually exclusive but indicating
one of them is mandatory. Other parameters are optional due to default values.
manual.thresholds has higher priority than exponential.quantile when indicated.

surankco-feature 
    -a assembly1.ace,assembly2.ace,assemblyXYZ.ace
    -d /home/user/mydata/
    -r qual
    -q illumina18
    -s ////.////d+-
    -t 4
    -m 32
    -k
    -g 1234567,2000000,111111
	
    --assemblies=assembly1.ace,assembly2.ace,assemblyXYZ.ace
    --directory=/home/user/mydata/
    --read.quality.format=qual
    --fastq.version=illumina18
    --split.regex=////.////d+-
    --threads=4
    --memory=32
    --kmer.features
    --expected.genome.size=1234567,2000000,111111
	
surankco-score 
    -a assembly1.contigs.fasta,assembly2.contigs.fasta,assemblyXYZ.contigs.fasta
    -d /home/user/mydata/
    -f contigs.fasta
    -r ref.fasta
    -p /home/user/mydata/surankco_score_histograms.pdf
    -m 32
	
    --assemblies=assembly1.contigs.fasta,assembly2.contigs.fasta,assemblyXYZ.contigs.fasta
    --directory=/home/user/mydata/
    --assembly.suffix=contigs.fasta
    --reference.suffix=ref.fasta
    --pdf.histograms=/home/user/mydata/surankco_score_histograms.pdf
    --memory=32
	
surankco-training 
    -f assembly1.features.txt,assembly2.features.txt,assemblyXYZ.features.txt
    -d /home/user/mydata/
    -o /home/user/mydata/mydata_rfs.RData
    -e 0.5
    -m 0.95,0.95,0.05,0.05,5,0.05,5,0.95

    --features=assembly1.features.txt,assembly2.features.txt,assemblyXYZ.features.txt
    --directory=/home/user/mydata/
    --output.filename=/home/user/mydata/mydata_rfs.RData
    --exponential.quantile=0.5
    --manual.thresholds=0.95,0.95,0.05,0.05,5,0.05,5,0.95
	
surankco-prediction 
    -f assembly_new.txt
    -r surankco_rfs.RData
    -o my_results.txt

    --features=assembly_new.txt 
    --random.forests=surankco_rfs.RData
    --output.filename=my_results.txt


TROUBLESHOOTING
---------------
P1: [OSX] Error loading package "randomForest" in R 3.1
 Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  kann shared object '/Library/Frameworks/R.framework/Versions/3.1/Resources/
  library/randomForest/libs/randomForest.so' nicht laden:
  dlopen(/Library/Frameworks/R.framework/Versions/3.1/Resources/library/
  randomForest/libs/randomForest.so, 6): Library not loaded: /Library/
  Frameworks/R.framework/Versions/3.0/Resources/lib/libgfortran.2.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.1/Resources/
  library/randomForest/libs/randomForest.so
  Reason: image not found
 Fehler: Laden von Paket oder Namensraum für ‘randomForest’ fehlgeschlagen

 randomForest tries to load external libraries of older R Version 3.0.*. 
 However, providing them by installing R 3.0.* in parallel leads to 
 incompatibilities (see next problem P2). Therefore, using R 3.0.3 solely is 
 recommended for OSX.

P2: [OSX] Memory error running suranco-training with R 3.1
 *** caught segfault ***
 address 0x18, cause 'memory not mapped'

 R 3.1 doesn't provide the required versions of external libraries for the
 randomForest package (see previous problem P1). Providing the libraries of 
 older R Versions leads to incompatibilities with R 3.1 and seems to produce a 
 segfault. Therefore, using R 3.0.3 solely is recommended for OSX.

P3: Permission denied when executing surankco, e.g.:
 -bash: /home/kuhringm/workspace/surankco/surankco-score: Permission denied
 
 The surankco scripts might have lost their execution permission. In a terminal, 
 go to the suranko directory and run "chmod +x surankco-feature surankco-score 
 surankco-training surankco-prediction".
 
P4: suranco-score terminates with following Permission denied error:
 sh: 1: /home/kuhringm/workspace/surankco/r/pslMatchFilter: Permission denied
 Error: pslMatchFilter not successfully executed: 126
 Execution halted

 pslMatchFilter needs execution permissions. In a terminal, go to the suranko
 directory and run "chmod +x r/pslMatchFilter".

P4: surankco-score prints dexp/NaN warning messages:
 Warning messages:
 1: In dexp(x, estimate, log = TRUE) : NaNs produced
 2: In dexp(x, rate = fit$estimate) : NaNs produced

 Some scores don't have enough variance to fit an exponential distribution. This
 is not critical in surankco-score, but might be a problem for the training
 (see problem P6).

P5: surankco-training prints dexp/NaN warning messages:
 Warning messages:
 1: In dexp(x, estimate, log = TRUE) : NaNs produced
 
 Some scores don't have enough variance to fit an exponential distribution.
 The training will not be possible (see problem P6).

P6: surankco-training terminates with following training error
 Error in randomForest.default(x = input[index, ], y = targets[index, i],  :
  Need at least two classes to do classification.
 Calls: rfTraining -> randomForest -> randomForest.default
 Execution halted
 
 Some scores don't provide two classes. Either the variance of these scores is
 indeed to low or the thresholds for class separation are poorly chosen. The
 variance might be increased by providing more contigs for the training.
 
	
--------------------------------------------------------------------------------
Copyright (c) 2014, 
Mathias Kuhring, KuhringM@rki.de, Robert Koch Institute, Germany, 
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, 
are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * The name of the author may not be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
DISCLAIMED. IN NO EVENT SHALL Mathias Kuhring BE LIABLE FOR ANY DIRECT, 
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Source: readme.txt, updated 2015-06-08