============================================================
Spines version 1.12 (October 2012)
Software for analysis of large genomic data sets
Spines copyright (c) Vertebrate Genome Biology Group, Broad Institute 7 Cambridge Center, Cambridge, MA 02142
FFTReal copyright (c) Laurent de Soras
============================================================
Licensing
Spines is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.
You should have received a copy of the Lesser GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Contents
IMPORTANT: the executables provided with the package require the gcc 4.4.3 runtime libraries. For all other gcc versions, you need to cleanly re-compile all executables on your system via
make clean > make
Supported Platforms
Spines exclusively runs on 64-bit Linux and has been tested on the Suse and Ubuntu (8.04) distributions (note: while not actively supported and tested, the
code compiles and runs on MacOS X 10.4.11 (Intel), gcc 4.0.1, when compiled with ‘make clean UNSUPPORTED=yes’ followed by ‘make UNSUPPORTED=yes’). Parallelization was tested on a server farm running LSF (Load Share Facility) on nodes that are fully accessible for communication via TCP/IP.
NOTE: the make file system requires csh to be installed.
Modules
Satsuma: high-sensitivity alignments through cross-correlation.
SatsumaSynteny: Satsuma in a battleship-style search framework.
References and credits
For Satsuma and SatsumaSynteny, please reference:
Grabherr MG, Russell P, Meyer M, Mauceli E, Alfoldi J, Di Palma F, Lindblad-Toh K. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010 May 1;26(9):1145-51. Epub 2010 Mar 5.
Satsuma
Satsuma aligns two fasta sequences exhaustively. For a small example, see the script ./test_Satsuma which runs on small sequences provided with the distribution for testing purposes.
Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-n<int> : number of blocks (def=1)
-lsf<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-same_only<bool> : only align sequences that have the same name. (def=0)
-self<bool> : ignore self-matches. (def=0)</bool></bool></double></bool></double></bool></bool></bool></bool></bool></int></int></int></int></string></string></string>
Note that Satsuma calls other executables (HomologyByXCorr, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./Satsuma” (see test_Satsuma).
Notes:
If the output directory is not empty, Satsuma will not overwrite any files but
exit with an error message.
The option “-n” specifies the number of processes, which will each take
chunks of the target sequence of size –t_chunk * 3⁄4. If the number of processes exceeds the available target sequence, this number is adjusted down.
SatsumaSynteny
Satsuma aligns two fasta sequences in a battleship fashion syntenically. For a small example, see the script ./test_SatsumaSynteny which runs on sequences provided with the distribution for testing purposes.
Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-t_chunk_seed<int> : target chunk size (seed) (def=8192)
-q_chunk_seed<int> : query chunk size (seed) (def=8192)
-n<int> : number of processes (def=1)
-ni<int> : number of initial search blocks (def=-1)
-lsf<bool> : submit jobs to LSF (def=0)
-lsf_ini<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-do_refine<bool> : refinment steps (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-cutoff_seed<double> : signal cutoff (seed) (def=2)
-m<int> : number of jobs per block (def=32)
-resume<string> : resumes w/ the output of a previous run (xcorrdata) (def=)
-seed<string> : loads seeds and runs from there (xcorr</string>data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-seeddist<string> : distance between pre-filter seeds (increase for close genomes) (def=1)
-dups<bool> : allow for duplications in the query sequence (def=0)
-filterwidth<string> : width of the seed filter (def=2)</string></bool></string></bool></int></string></int></double></double></bool></double></bool></bool></bool></bool></bool></bool></bool></int></int></int></int></int></int></int></string></string></string>
Note that SatsumaSynteny calls other executables (FilterGridSeeds, HomologyByXCorr, HomologyByXCorrSlave, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./SatsumSynteny” (see test_SatsumaSynteny).
Notes:
If the output directory is not empty, SatsumaSynteny will not overwrite any
files but exit with an error message.
Idling processes self-terminate after two minutes. The overall alignments
will still complete, but using fewer processes.
If alignment runs locally but not on the server farm, check whether
processes on the farm can communicate via TCP/IP.
Currently, the entire sequences are loaded into RAM by each process. For
comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).
Parameter choice, execution and data preparation:
The default parameters should work well for most genomes.
SatsumaSynteny runs most efficiently on either multi-processor machines
or on clusters that are tightly coupled (fast access to files shared by the
control process and the slaves)
Especially for larger genomes, we recommend leaving one CPU dedicated
to the control process SatsumaSynteny.
For larger genomes (>1Gb), we recommend using one chromosome of
one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. We tested this strategy successfully on a mammalian genome pair.
To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.
If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.
Output files
Alignment coordinates:
<outdir>/satsuma_summary.out: all alignment coordinates (Satsuma only) <outdir>/satsuma_summary.chained.out: final coordinates (Satsuma and SatsumaSynteny)
Contents: Target sequence name (provided by fasta) First target base Last target base Query sequence name (provided by fasta) First query base Last query base Identity Orientation</outdir></outdir>
Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.
Other output:
<outdir>/MergeXCorrMatches.out: readable alignments (Satsuma only) <outdir>/MergeXCorrMatches.refined.out: final readable alignments (Satsuma and
SatsumaSynteny)</outdir></outdir>
Conversion to MizBee format
Run BlockDisplaySatsuma by supplying the query and target genome fasta files and
============================================================
Spines version 1.12 (October 2012)
Software for analysis of large genomic data sets
Spines copyright (c) Vertebrate Genome Biology Group, Broad Institute 7 Cambridge Center, Cambridge, MA 02142
FFTReal copyright (c) Laurent de Soras
============================================================
Licensing
Spines is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.
You should have received a copy of the Lesser GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
IMPORTANT: the executables provided with the package require the gcc 4.4.3 runtime libraries. For all other gcc versions, you need to cleanly re-compile all executables on your system via
Supported Platforms
Spines exclusively runs on 64-bit Linux and has been tested on the Suse and Ubuntu (8.04) distributions (note: while not actively supported and tested, the
code compiles and runs on MacOS X 10.4.11 (Intel), gcc 4.0.1, when compiled with ‘make clean UNSUPPORTED=yes’ followed by ‘make UNSUPPORTED=yes’). Parallelization was tested on a server farm running LSF (Load Share Facility) on nodes that are fully accessible for communication via TCP/IP.
NOTE: the make file system requires csh to be installed.
Modules
SatsumaSynteny: Satsuma in a battleship-style search framework.
References and credits
For Satsuma and SatsumaSynteny, please reference:
Grabherr MG, Russell P, Meyer M, Mauceli E, Alfoldi J, Di Palma F, Lindblad-Toh K. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010 May 1;26(9):1145-51. Epub 2010 Mar 5.
Satsuma
Satsuma aligns two fasta sequences exhaustively. For a small example, see the script ./test_Satsuma which runs on small sequences provided with the distribution for testing purposes.
Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-n<int> : number of blocks (def=1)
-lsf<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-same_only<bool> : only align sequences that have the same name. (def=0)
-self<bool> : ignore self-matches. (def=0)</bool></bool></double></bool></double></bool></bool></bool></bool></bool></int></int></int></int></string></string></string>
Note that Satsuma calls other executables (HomologyByXCorr, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./Satsuma” (see test_Satsuma).
Notes:
exit with an error message.
The option “-n” specifies the number of processes, which will each take
chunks of the target sequence of size –t_chunk * 3⁄4. If the number of processes exceeds the available target sequence, this number is adjusted down.
SatsumaSynteny
Satsuma aligns two fasta sequences in a battleship fashion syntenically. For a small example, see the script ./test_SatsumaSynteny which runs on sequences provided with the distribution for testing purposes.
Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-t_chunk_seed<int> : target chunk size (seed) (def=8192)
-q_chunk_seed<int> : query chunk size (seed) (def=8192)
-n<int> : number of processes (def=1)
-ni<int> : number of initial search blocks (def=-1)
-lsf<bool> : submit jobs to LSF (def=0)
-lsf_ini<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-do_refine<bool> : refinment steps (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-cutoff_seed<double> : signal cutoff (seed) (def=2)
-m<int> : number of jobs per block (def=32)
-resume<string> : resumes w/ the output of a previous run (xcorrdata) (def=)
-seed<string> : loads seeds and runs from there (xcorr</string>data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-seeddist<string> : distance between pre-filter seeds (increase for close genomes) (def=1)
-dups<bool> : allow for duplications in the query sequence (def=0)
-filterwidth<string> : width of the seed filter (def=2)</string></bool></string></bool></int></string></int></double></double></bool></double></bool></bool></bool></bool></bool></bool></bool></int></int></int></int></int></int></int></string></string></string>
Note that SatsumaSynteny calls other executables (FilterGridSeeds, HomologyByXCorr, HomologyByXCorrSlave, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./SatsumSynteny” (see test_SatsumaSynteny).
Notes:
files but exit with an error message.
will still complete, but using fewer processes.
processes on the farm can communicate via TCP/IP.
comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).
Parameter choice, execution and data preparation:
or on clusters that are tightly coupled (fast access to files shared by the
control process and the slaves)
to the control process SatsumaSynteny.
one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. We tested this strategy successfully on a mammalian genome pair.
When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.
Output files
Alignment coordinates:
<outdir>/satsuma_summary.out: all alignment coordinates (Satsuma only) <outdir>/satsuma_summary.chained.out: final coordinates (Satsuma and SatsumaSynteny)
Contents: Target sequence name (provided by fasta) First target base Last target base Query sequence name (provided by fasta) First query base Last query base Identity Orientation</outdir></outdir>
EXAMPLE:
chrX:0-800000 2001 2287 chrX:0-1000000 3258 3560 0.622378 +
chrX:0-800000 2321 2565 chrX:0-1000000 3610 3853 0.590164 +
chrX:0-800000 2607 2768 chrX:0-1000000 3935 4096 0.614907 +
Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.
Other output:
<outdir>/MergeXCorrMatches.out: readable alignments (Satsuma only) <outdir>/MergeXCorrMatches.refined.out: final readable alignments (Satsuma and
SatsumaSynteny)</outdir></outdir>
Run BlockDisplaySatsuma by supplying the query and target genome fasta files and
Options:
-i<string> : satsuma summary file
-t<string> : target fasta file
-q<string> : query fasta file
-min<int> : minimum block size (def=3)
-s<int> : minimum scaffold size (def=100000)
-transpose<bool> : switch query and target (def=0)</bool></int></int></string></string></string>