Download Latest Version sfaspa-0.2.1-build64.tar.gz (66.4 MB)
Email in envelope

Get an email when there's a new version of SPA

Home / SFA-SPA binary (64 bit linux)
Name Modified Size InfoDownloads / Week
Parent folder
sfaspa-0.2.1-build64.tar.gz 2015-08-26 66.4 MB
sfaspa-0.2.0-build64.tar.gz 2014-11-07 62.7 MB
README 2014-11-06 10.3 kB
Totals: 3 Items   129.1 MB 1
SFA-SPA: a suffix array based short peptide assembler for metagenomic data

SFA-SPA [1] was written in C++ (g++ 4.8.2) and has been tested on a 
64-bit Linux box. Boost library [2] (version 1.54.0 or higher) is 
required  during compilation. 
The memory requirement largely depends on size of the input read set. 
Perl is used for scripting the various SPA components. 
See INSTALL for perl libraries dependancy.

The main script is called spa_suit.pl and it takes as input nucleotide
fasta file(s) together with a parameter file. The parameter file can
be configured to run on either Illumina sequence data or (fragment)
read sequence data (for instance, generated by 454's technology). The
current implementation of SPA however mainly targets (and has been
tested on) Illumina's technology (fragment or paired-end). We will
extend our implementation in a future release to fully incorporate
other sequencing technologies.

Pair-end read convention used for Illumina data: Please make sure that
members of the read pair can be identified by /1 and /2.  We will make
it work for other identifiers and formats in future updates. SFA-SPA
handles both paired and unpaired data. In fact, when the gene finder
is applied to the interleaved paired-end data, it can happen that only
one of the fragments has a gene called on it. Thus, the actual input
to the SFA-SPA program is an interleaved file (with some read ids
missing). SFA-SPA can handle multiple input files, and therefore can also
handle a paired-end read dataset together with a separate fragment
read dataset.

==================================

Prerequisite:
1. gcc-4.8.2 or newer
2. boost-1.54.0 or newer

==================================

Program Installation:

1. Unpack the tarball file.
   $ tar xzvf sfaspa-0.2.0.tar.gz
2. Install perl libraries, 3rd party softwares, and SFA-SPA
   $ cd sfaspa-0.2.0
   $ ./install.sh
3. Export paths.
   $ export SPA_HOME=`pwd`

Alternatively, edit your configuration file (e.g.: .bashrc,
.bash_profile) to make permanent changes.

Please note all third party programs [3,4,5,6] have only been included 
with the spa tarball for convenience. We recommend using the
latest versions of these programs. Since we are not the developers of
these programs, we are not responsible for their correctness,
maintenance, or updates. Please follow any guidelines or updates from
their developers' web sites.

==================================

Set up environmental variables:
$ source $SPA_HOME/sfaspa_env.sh

Run the above configuration script for each SFA-SPA assembly run.
Alternatively, copy and paste entire contents of sfa-spa_env.sh into 
your configuration file (e.g.: .bashrc,.bash_profile) to make permanent changes.

==================================

To run the peptide assembler:
The overall assembly process has three stages - (Stage 1) GF: Gene
finding stage, (Stage 2) SPA: peptide assembly stage, and (Stage 3)
PP: post-processing stage. Please see the SPA paper for details of
each stage. All of these stages (GF+SPA+PP) can be run together using
the main script spa_suit.pl; it takes DNA reads in FASTA format and
performs gene calling, peptide assembly, and post-processing in a
single batch.


Usage: spa_suit.pl [-o <output directory>] -p <parameter file> -i <FASTA files>
      -p, --parameters : [required] string 	  Program parameter file
      -i, --input      : [required] string 	  FASTA file(s) of DNA reads
      -o, --output     : [optional] string	  Output directory (default:.)
      -h, --help                           	  Print this message

Multiple input files can be provided by listing file names after -i option.
e.g. spa_suit.pl -o . -p parameter.illumina -i file1.fasta file2.fasta file3.fasta

The parameter file specifies the parameters for the programs in each of
the three stages. See parameter.illumina and parameter.generic for two
examples of parameter files that we have included (the former is for
running on Illumina data using FragGeneScan and the later is for a
generic input nucleotide fasta file). The parameter file can (and
should) be edited depending on the quality and depth of coverage of
your input nucleotide sequence data.


Parameters for Stage 2 that will affect run time and output quality
are given below (along with suggested default values for Illumina
data). As noted above, these may need to be changed depending on the
input data (Lines starting with # are comments).

## Size of kmer
kmer    6

## Seed coverage
## Minimum depth of seed
seed-coverage	5
## When a dataset is large, 50% of seed coverage may be a good starting point. 
## When overall k-mer coverages are high, seed coverage 5 is another good setting. 
## When the data is small or k-mer coverages are low, lowering seed coverage to 2 seems reasonable.  

## Minimum supporting reads in a neighboring node.
## Supporting reads between current and neighboring nodes with lower 
## than the following support are ignored.
read-support	5
## When k-mer coverages are high, setting minimum read support to 5 is 
## practical. In case of low k-mer coverages, setting it to 2 could be good. 


## Mininum overlap length of suffix array search
overlap-length	15
## With the longer length (e.g., 25), initial path identification step 
## becomes faster but may generates shorter paths due to failures in 
## suffix array searches for finding supporting reads. 
## On the other hand, with the shorter length (e.g., 15), the same step 
## may generate longer sequences because more reads are found from 
## suffix array search but it takes more time because it need more 
## rounds of suffix array searches. Through our experiments, minimum 
## overlap length 15 showed the best accuracies for datasets with 100 
## base-pair nucleotide reads (i.e., length 33 peptide reads)

## Reusing seed kmers
seed-reuse 0
## In order to avoid extra computational time originating from potential 
## repetitive graph traversal on same sub-graph, any seed k-mers found in 
## previous assembled paths can be ignored for seeding paths. 
## This parameter is also useful when it is used with seed coverage. 
## When small amount of seeds are used (e.g., 50% of k-mers), it is better to use 
## k-mers found in previous assembled paths for seeding paths. 
## On the other hands, minimum seed coverage is low, it is practical not to use 
## k-mers found in previous assembled paths.

==================================

Example: An example simulated Illumina data set is provided in the
example directory.  Run SFA-SPA with a file of pre-configured program
options.
$ spa_suit.pl -p param/parameter.illumina -o example -i example/reads.fasta.gz &> example/spa.suit.log

In this example, SFA-SPA uses preset program parameters in parameter.illumina file.

==================================

The script spa_suit.pl actually calls three scripts call_orfs.pl (GF
stage), run_spa.pl (SPA stage), and clean_spa.pl (PP stage). These
scripts can be run separately as well, as shown below.

(GF stage) Call genes.
The following script takes DNA reads in FASTA format and performs gene
prediction by calling a gene predictor, and process the gene products
to make peptides reads to used in SFA-SPA.

Usage: call_orfs.pl [options]
Examples:
---------
1. Call ORFs with FGS
$ call_orfs.pl -i example/reads.fasta.gz -o example/orf -n 2 -t complete &> example/orf.log

2. Call ORFs with MGA
$ call_orfs.pl -i example/reads.fasta.gz -o example/orf -g 1 -n 2 &> example/orf.log

(Get the size of suffix array partitions)
One of required option to run SFA-SPA is the number of suffix arrays.
To get the suffix array partitions, run the following.
Usage: part [options]
Example:
part -i example/orf/fgs.noindel.faa

(SPA stage) Run SFA-SPA. 
The following script takes FASTA formatted amino acid reads. It
consists of two steps. First, it prepares the peptide assembler
inputs, and then it calls the assembly program.

Usage: run_spa.pl [options]
Example:
--------
$ run_spa.pl -s 1 -n 8 -M -i example/orf/fgs.noindel.faa -o example/spa -P -A &> example/spa.run.log

The script above is equivalent to the following two successive commands.
[1]. Make SFA-SPA input only.
prespa [options]
Example:
--------
$ prespa -s 1 -k 6 -i example/orf/fgs.noindel.faa -o example/spa &> example/spa/pre.log

[2]. Run SFA-SPA using graph input and suffix array generated from the previous stage.
Usage: spa [options]
Example:
-------
$ spa -s 1 -k 6 -n 8 -o example/spa --profile --alignment --pair-end -i example/orf/fgs.noindel.faa &> example/spa/spa.log
As in the example, enforce "--pair-end" flag to assemble pair-end reads.

(PP stage) Post processing: 
The following script performs post-processing of SFA-SPA assembly by
re-calling genes and disregarding short paths. Output: post.fasta

Usage:clean_spa.pl [options]
Example:
-------
$ clean_spa.pl -n 8 -a example/orf/fgs.noindel.faa -d example/orf/fgs.noindel.ffn -s example/spa/spa.fasta -r example/spa/spa.place.bin -o example/post -T 0 &> example/post/post.log
In this example, FGS is used for re-calling genes. 

==================================

SPA citation
If you use SPA in your work, please cite the associated publication.
SPA: a short peptide assembler for metagenomic data, Youngik Yang and, Shibu Yooseph,  Nucl. Acids Res. (2013) 41 (8): e91. doi: 10.1093/nar/gkt118


Contact information:
Youngik Yang (yyang at jcvi.org) and Shibu Yooseph (syooseph at jcvi.org)

==================================

References:
1. SFA-SPA: a suffix array based short peptide assembler for metagenomics, Youngik Yang, CunCong Zhong, and Shibu Yooseph, 2014, Submitted.
2. boost C++ libraries, htt://www.boost.org.
3. SeqAn an efficient, generic C++ library for sequence analysis, Andreas Döring, David Weese, Tobias Rausch and Knut Reinert, BMC Bioinformatics, 9:11, 2008.
4. libdivsufsort - A lightweight suffix-sorting library, Yuta Mori, 2008, http://code.google.com/p/libdivsufsort/).
5. FragGeneScan: predicting genes in short and error-prone reads, Mina Rho, Haixu Tang, and Yuzhen Ye, Nucleic Acids Res. Nov 2010; 38(20): e191. 
6. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes, Hideki Noguchi, Takeaki Taniguchi, and Takehiko Itoh, DNA Res. Dec 2008; 15(6): 387–396. 
Source: README, updated 2014-11-06