============================================================
Spines version 1.12 (October 2012)
Software for analysis of large genomic data sets
Spines copyright (c) Vertebrate Genome Biology Group, Broad Institute 7 Cambridge Center, Cambridge, MA 02142
FFTReal copyright (c) Laurent de Soras

============================================================
Licensing
Spines is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.
You should have received a copy of the Lesser GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

  1. Contents
    IMPORTANT: the executables provided with the package require the gcc 4.4.3 runtime libraries. For all other gcc versions, you need to cleanly re-compile all executables on your system via

make clean > make

  1. Supported Platforms
    Spines exclusively runs on 64-bit Linux and has been tested on the Suse and Ubuntu (8.04) distributions (note: while not actively supported and tested, the
    code compiles and runs on MacOS X 10.4.11 (Intel), gcc 4.0.1, when compiled with ‘make clean UNSUPPORTED=yes’ followed by ‘make UNSUPPORTED=yes’). Parallelization was tested on a server farm running LSF (Load Share Facility) on nodes that are fully accessible for communication via TCP/IP.
    NOTE: the make file system requires csh to be installed.

  2. Modules

  3. Satsuma: high-sensitivity alignments through cross-correlation.
  4. SatsumaSynteny: Satsuma in a battleship-style search framework.

  5. References and credits
    For Satsuma and SatsumaSynteny, please reference:
    Grabherr MG, Russell P, Meyer M, Mauceli E, Alfoldi J, Di Palma F, Lindblad-Toh K. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010 May 1;26(9):1145-51. Epub 2010 Mar 5.

  6. Satsuma
    Satsuma aligns two fasta sequences exhaustively. For a small example, see the script ./test_Satsuma which runs on small sequences provided with the distribution for testing purposes.

Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-n<int> : number of blocks (def=1)
-lsf<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-same_only<bool> : only align sequences that have the same name. (def=0)
-self<bool> : ignore self-matches. (def=0)</bool></bool></double></bool></double></bool></bool></bool></bool></bool></int></int></int></int></string></string></string>

Note that Satsuma calls other executables (HomologyByXCorr, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./Satsuma” (see test_Satsuma).
Notes:

  • If the output directory is not empty, Satsuma will not overwrite any files but
    exit with an error message.
  • The option “-n” specifies the number of processes, which will each take
    chunks of the target sequence of size –t_chunk * 3⁄4. If the number of processes exceeds the available target sequence, this number is adjusted down.

  • SatsumaSynteny
    Satsuma aligns two fasta sequences in a battleship fashion syntenically. For a small example, see the script ./test_SatsumaSynteny which runs on sequences provided with the distribution for testing purposes.

Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-t_chunk_seed<int> : target chunk size (seed) (def=8192)
-q_chunk_seed<int> : query chunk size (seed) (def=8192)
-n<int> : number of processes (def=1)
-ni<int> : number of initial search blocks (def=-1)
-lsf<bool> : submit jobs to LSF (def=0)
-lsf_ini<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-do_refine<bool> : refinment steps (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-cutoff_seed<double> : signal cutoff (seed) (def=2)
-m<int> : number of jobs per block (def=32)
-resume<string> : resumes w/ the output of a previous run (xcorrdata) (def=)
-seed<string> : loads seeds and runs from there (xcorr</string>
data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-seeddist<string> : distance between pre-filter seeds (increase for close genomes) (def=1)
-dups<bool> : allow for duplications in the query sequence (def=0)
-filterwidth<string> : width of the seed filter (def=2)</string></bool></string></bool></int></string></int></double></double></bool></double></bool></bool></bool></bool></bool></bool></bool></int></int></int></int></int></int></int></string></string></string>

Note that SatsumaSynteny calls other executables (FilterGridSeeds, HomologyByXCorr, HomologyByXCorrSlave, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./SatsumSynteny” (see test_SatsumaSynteny).
Notes:

  • If the output directory is not empty, SatsumaSynteny will not overwrite any
    files but exit with an error message.
  • Idling processes self-terminate after two minutes. The overall alignments
    will still complete, but using fewer processes.
  • If alignment runs locally but not on the server farm, check whether
    processes on the farm can communicate via TCP/IP.
  • Currently, the entire sequences are loaded into RAM by each process. For
    comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).
    Parameter choice, execution and data preparation:
  • The default parameters should work well for most genomes.
  • SatsumaSynteny runs most efficiently on either multi-processor machines
    or on clusters that are tightly coupled (fast access to files shared by the
    control process and the slaves)
  • Especially for larger genomes, we recommend leaving one CPU dedicated
    to the control process SatsumaSynteny.
  • For larger genomes (>1Gb), we recommend using one chromosome of
    one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. We tested this strategy successfully on a mammalian genome pair.
  • To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.
  • If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
  • When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.

  • Output files
    Alignment coordinates:
    <outdir>/satsuma_summary.out: all alignment coordinates (Satsuma only) <outdir>/satsuma_summary.chained.out: final coordinates (Satsuma and SatsumaSynteny)
    Contents: Target sequence name (provided by fasta) First target base Last target base Query sequence name (provided by fasta) First query base Last query base Identity Orientation</outdir></outdir>

EXAMPLE:
chrX:0-800000 2001 2287 chrX:0-1000000 3258 3560 0.622378 +
chrX:0-800000 2321 2565 chrX:0-1000000 3610 3853 0.590164 +
chrX:0-800000 2607 2768 chrX:0-1000000 3935 4096 0.614907 +

Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.
Other output:
<outdir>/MergeXCorrMatches.out: readable alignments (Satsuma only) <outdir>/MergeXCorrMatches.refined.out: final readable alignments (Satsuma and
SatsumaSynteny)</outdir></outdir>

  1. Conversion to MizBee format
    Run BlockDisplaySatsuma by supplying the query and target genome fasta files and

Options:
-i<string> : satsuma summary file
-t<string> : target fasta file
-q<string> : query fasta file
-min<int> : minimum block size (def=3)
-s<int> : minimum scaffold size (def=100000)
-transpose<bool> : switch query and target (def=0)</bool></int></int></string></string></string>