spinner 1.0
A scaffolding tool for de novo assembly.
spinner -f <files file name = files.txt > -s <settings file name = settings.txt>
Files file contains details of smalt output, fast files and output names.
default names files.txt and settings.txt
General Instructions:
(1)
You should start with a fasta or fastq file containing contigs/scaffolds from a
previous stage of assembly, and one or more libraries with pairs of reads.
(2)
It is recommended that the contigs are renamed using rename-contigs.pl
rename-contigs <in> <out> <file to store old names against new names>
To keep old names is possible but slows spinner (see below)
(3)
Align the pairs against the renamed fastq/fasta file using smalt.
For each library, concatenate all results into one file.
This only needs to be done once to scaffold the same data many times.
(4) Write/edit "files file" and "settings file" as described below
(5) Run spinner eg
<dir>/spinner
if your files have the default names or
<dir>/spinner -f name1.txt -s name2.txt
(6) For better results, repeat from stage (4) with altered parameters.
Look at spinner output for lines starting DoScaffolding and make sure that each
stage
is reducing the number of scaffolds. Otherwise remove or alter that stage in
settings file.
+++++++++++ files file +++++++++++++++++++++++++++
The files file is a set of pairs of lines, the first of which specifies the file
or option
and the second of which specifies the corresponding files/parameters
the allowed line types are
fastq-in or fasta-in (must be 1 and only 1 of these)
smalt (must be >=1 of these)
fasta-out
fastq-out
contigs-out
GDF-out
number-in-contig-name (best to leave this out)
See the example files.txt
-------------------------------
fastq-in
or
fasta-in
This should come first. The following line contains the fileneame e.g.
fastq-in
renamed.fastq
Make sure that this is the file with the contigs names corresponding to
the smalt results.
-------------------------------
smalt
Next should come lines giving files and parameters for a smalt input files.
e.g.
smalt # <file_name> <mean_insert> <std_dev> <weight> <read_length> <orientation=in|out>
mates-2k.align 1400 242 100 75 in
smalt
mates-3k.align 2817 396 60 50 in
smalt
mates-4k.align 4161 795 25 76 in
the general format is
smalt # comment
<filename> <mean insert> <insert standard deviation> <weight> <read length>
<orientation= in|out>
If the insert stats are not know use estimate-insert to find them.
The weight weights edges from this library relative to the others when making
quick estimates
of gap sizes. Something roughly inversely proportional the standard deviation
is a good value unless the library is considered (un)reliable for some other reason.
Read length is easily found. This affects cross-biotin calling. If length varies
across the library set this to 1000 as the check is turned off for long reads.
The orientation is the direction that read pairs point, in meaning towards one another.
Any number of libraries can be specified.
-------------------------------
fasta-out or fastq-out
Optional.
Specifies the output file name e.g.
fasta-out
spinner.fasta
Obviously, fastq can only be output if it was input. Both fasta and fastq can be output
at once.
-------------------------------
contigs-out
Optional.
Gives record of the original contigs now in scaffolds
contigs-out
contigs.dat
The format of this file is as follows e.g.
supercontig supercontigs_000000567 len 35192822junction 0<-->0
contig supercontigs_000000567 len 81669 reads/bp 0.009697
gap 450 517 81669 1
this line gives gap size, number of supporting pairs, position relative to start of scaf of the end of preceeding contig, and the stage at which it the join was made (See later).
-------------------------------
GDF-out
Optional.
Gives graph file output suitable for e.g. gephi. (Might be useful in that it lists all
remaining edges). Ignores scaffolds smaller than 500bp. E.g.
GDF-out
alt.gdf
--------------------------------
number-in-contig-name
Do not use this option unless there is a very good reason to preserve original names for contigs. Include at the head of the file (before even fastq/q input). If original contig names are desired, this needs to be set. IF there is a unique number in each contig name at the same position, give that position. I.e.
myscaf_0001
myscaf_0005
myscaf_0006
etc. would need
number-in-contig-name
8
while f001 etc would need the number 2. If there is no unique number -1 can be specified but this affects performance and it is recommended that contigs are renamed with rename-contigs.pl instead, which stores the original names in any case.
+++++++++++ settings file +++++++++++++++++++++++++++
The settings file contains the parameters for the run.
See the example settings.txt
settings.txt controls spinner's parameters.
Each "stage" takes the results so far and runs
spinners' scaffolding routines with the given
parameters (without remapping). The idea is
to progress to more aggressive settings after
more obvious joins have been made, as previous
joins reduce the complexity of the graph.
------------------------------------------------------
The most basic commands is
STAGE
Which begins the parameters for each stage (helpful to accompany with
a comment giving the number of stages so far).
All "required" stage parameters are required for first stage and when
changed. "Optional" have a default value but, like required fields,
preserve any value set through the stages unless explicitly changed.
The order has no significance.
The two most important parameters are given here first:
PAIR_THRESHOLD <number>
Required: number of pairs linking two scaffolds needed to
consider the connection ("edge") as a candidate to join (although
weaker links can sometimes still block others from joining).
smaller number are more aggressive. Required for first stage and when
changed.
STRENGTH_TEST_RATIO <number>
Required: when two edges conflict, if the ratio of the
number of supporting pairs is less than this value, the weaker edge
is discarded. Larger values are more aggressive. Required for
first stage and when changed.
EXCLUDE_SMALL_CONTIGS <number | OFF >
Required: ignores scaffolds of length smaller than the parameter for
all scaffolding purposes at this stage. Comment out for no cutoff.
PREVENT_TWO_SMALL_SCAFFOLDS_JOINING <number | OFF >
Required: Joins between two scaffolds both smaller than this value are
stopped (although they may still prevent other joins to the
same scaffolds).
FIND_SIGMAS <number, default = 5>
MIN_UNCERTAINTY <number, default = 100>
Optional: When estimating relative positions of scaffolds,
an overlap is called when relative positions +- an error estimate
mulitplied by FIND_SIGMAS, or MIN_UNCERTAINTY, whichever is smallest.
More overlaps mean more conflicts, preventing joins, so higher values
are more aggressive. Higher values may help for bad quality libraries.
Required for first stage and when changed.
JUMP_LENGTH <number | OFF default = OFF>
Optional: when looking for the next scaffold to join to scaffold X,
spinner estimates the position of the scaffold relative to X. With
this option, if no unconflicted "closest" edge can be found the usual way,
scaffolds which *entirely* fall into an interval of this size next to
X are ignored allowing the scaffolding to "jump over" short-range
conflicts. Note that this has a different affect from excluding short
libraries -- a short gap to a long contig would not be ignored.
This is an aggressive option that should not be aplied in the
first scaffolding attempt and should be preceeded by less aggressive stages.
as with required parameters, stays set for remaining stages unless changed.
Problems occur when erroneous edges are no longer blocked by a true edge;
this parameter should not be much longer than the longest insert.
-----------------------------------------------
Advanced options
The following optional parameters have default values and are unlikely to be
needed most users.
GOOD_GAP_ESTIMATES <ON|OFF default =ON>
In the scaffolding process, gaps are estimated using a simple average.
Once a join is made, a maximum likelihood estimate against the real distribution
of inserts (calculated from same-contig pairs) or a normal distribution is used.
Setting this to OFF turns of the MLE, which saves time at the cost of good final
gap estimates (this can mildly affect overall length results too).
VERBOSE <ON|OFF default = OFF>
When turned on this produces a long stream of output that "explains"
every link made. Search for a contig name to see when it was involved
in join decisions.
CHIMERA_SIGMAS <number, default=5>
pairs which imply a very negative gap size are probably chimeras
(or else there is something unusual going on to do with heterozygosity
or misassembly). If the pair implies a gap size more nagative than
this number times the libraries standard deviation, it is considered a bad
pair and not counted in gap estimation, strength test etc.
MATCH_SCORE_THRESHOLD <number default=15>
Spinner rejects reads with a smalt match score less than this. This gets rid of
matches that could have been in more than one place.
CROSS_BIOTIN_PARAMETER <number default = 20>
Spinner rejects a read pair if one has a near perfect SW score and the other
misses the perfect score by this paremeter or more. These can often be
cross-biotin pairs. 20 does not reject many for illumina data.
DIRECTION_TEST_RATIO <number default =3>
when edges between two scaffolds have conflicting direction properties
they are discarded if the strongest one is this much stronger than the next
strongest.
WEAK_EDGE_STRENGTH_RATIO <number default= max (strength ratio, lowest pair threshold / current threshold)>
Conflicts can involve edges weaker than the threshold strength, as explained
above. The strength ratio applied in this case is set as a function of
the minimum threshold used and the usual ratio, unless overridden.
#################################
STAGE
PAIR_THRESHOLD 30
STRENGTH_TEST_RATIO 0.0
EXCLUDE_SMALL_CONTIGS 3000
PREVENT_TWO_SMALL_SCAFFOLDS_JOINING 5000
# here we choose a high pair threshold, exclude a lot of small contigs, and
# turn off the strength test so that all conflicts are left unresolved.
#################################
STAGE
EXCLUDE_SMALL_CONTIGS 500
# perhaps we missed some joins by excluding small contigs.
#################################
STAGE
PAIR_THRESHOLD 15
EXCLUDE_SMALL_CONTIGS 3000
#################################
STAGE
EXCLUDE_SMALL_CONTIGS 500
#same again with lower threshold.
#################################
STAGE
PAIR_THRESHOLD 30
STRENGTH_TEST_RATIO 0.5
# now turn on strength ratio for some more aggressive stages.
#################################
STAGE
PAIR_THRESHOLD 15
#################################
STAGE
PAIR_THRESHOLD 8
#################################
STAGE
PAIR_THRESHOLD 6
#################################
STAGE
PAIR_THRESHOLD 8
JUMP_LENGTH 2000
# jump length can now move on to favour long-insert joins
# that may be being blocked by erroneous short connections.
#################################
STAGE
JUMP_LENGTH 4000
#################################
STAGE
JUMP_LENGTH 10000
#################################
STAGE
JUMP_LENGTH 20000
#################################
STAGE
JUMP_LENGTH OFF
PREVENT_TWO_SMALL_SCAFFOLDS_JOINING OFF
# add more stages if it seems to help
#################################
STAGE
PAIR_THRESHOLD 6
JUMP_LENGTH 4000
#################################