Download Latest Version scaff10x-v4.1.tar.gz (3.0 MB)
Email in envelope

Get an email when there's a new version of Phusion2

Home / spinner
Name Modified Size InfoDownloads / Week
Parent folder
spinner.README 2014-08-22 11.6 kB
spinner.tar.gz 2014-08-22 625.6 kB
Totals: 2 Items   637.3 kB 0
spinner 1.0

A scaffolding tool for de novo assembly.

spinner -f <files file name = files.txt > -s <settings file name = settings.txt>
Files file contains details of smalt output, fast files and output names.
default names files.txt and settings.txt

General Instructions:

(1)
You should start with a fasta or fastq file containing contigs/scaffolds from a
previous stage of assembly, and one or more libraries with pairs of reads.

(2)
It is recommended that the contigs are renamed using rename-contigs.pl

rename-contigs <in> <out> <file to store old names against new names>

To keep old names is possible but slows spinner (see below)

(3)
Align the pairs against the renamed fastq/fasta file using smalt.
For each library, concatenate all results into one file.
This only needs to be done once to scaffold the same data many times.

(4) Write/edit "files file" and "settings file" as described below

(5) Run spinner eg

<dir>/spinner

if your files have the default names or

<dir>/spinner -f name1.txt -s name2.txt

(6) For better results, repeat from stage (4) with altered parameters.
Look at spinner output for lines starting DoScaffolding and make sure that each
stage
is reducing the number of scaffolds.  Otherwise remove or alter that stage in
settings file.


+++++++++++ files file +++++++++++++++++++++++++++

The files file is a set of pairs of lines, the first of which specifies the file
or option
and the second of which specifies the corresponding files/parameters

the allowed line types are

fastq-in or fasta-in (must be 1 and only 1 of these)
smalt  (must be >=1 of these)
fasta-out
fastq-out
contigs-out
GDF-out
number-in-contig-name (best to leave this out)

See the example files.txt

-------------------------------
fastq-in
or 
fasta-in

This should come first.  The following line contains the fileneame e.g.

fastq-in
renamed.fastq

Make sure that this is the file with the contigs names corresponding to
the smalt results.

-------------------------------
smalt

Next should come lines giving files and parameters for a smalt input files.

e.g.

smalt # <file_name> <mean_insert> <std_dev> <weight> <read_length> <orientation=in|out>
mates-2k.align 1400 242 100 75 in
smalt
mates-3k.align 2817 396 60 50 in
smalt
mates-4k.align 4161 795 25 76 in

the general format is

smalt # comment
<filename> <mean insert> <insert standard deviation> <weight> <read length>
<orientation= in|out>

If the insert stats are not know use estimate-insert to find them.
The weight weights edges from this library relative to the others when making
quick estimates
of gap sizes.  Something roughly inversely proportional the standard deviation
is a good value unless the library is considered (un)reliable for some other reason.
Read length is easily found.  This affects cross-biotin calling.  If length varies
across the library set this to 1000 as the check is turned off for long reads.
The orientation is the direction that read pairs point, in meaning towards one another.
Any number of libraries can be specified.

-------------------------------
fasta-out or fastq-out

Optional.
Specifies the output file name e.g.
fasta-out
spinner.fasta

Obviously, fastq can only be output if it was input.  Both fasta and fastq can be output 
at once.

-------------------------------
contigs-out

Optional.
Gives record of the original contigs now in scaffolds

contigs-out
contigs.dat

The format of this file is as follows e.g.

supercontig supercontigs_000000567 len 35192822junction  0<-->0
contig supercontigs_000000567 len 81669 reads/bp 0.009697
gap 450 517 81669 1
this line gives gap size, number of supporting pairs, position relative to start of scaf of the end of preceeding contig, and the stage at which it the join was made (See later).


-------------------------------
GDF-out

Optional.
Gives graph file output suitable for e.g. gephi. (Might be useful in that it lists all
remaining edges).  Ignores scaffolds smaller than 500bp. E.g.

GDF-out
alt.gdf

--------------------------------
number-in-contig-name

Do not use this option unless there is a very good reason to preserve original names for contigs.  Include at the head of the file (before even fastq/q input).  If original contig names are desired, this needs to be set.  IF there is a unique number in each contig name at the same position, give that position.  I.e.
myscaf_0001
myscaf_0005
myscaf_0006
 etc. would need

number-in-contig-name
8

while f001 etc would need the number 2.  If there is no unique number -1 can be specified but this affects performance and it is recommended that contigs are renamed with rename-contigs.pl instead, which stores the original names in any case.


+++++++++++ settings file +++++++++++++++++++++++++++

The settings file contains the parameters for the run.
See the example settings.txt


settings.txt controls spinner's parameters.
Each "stage" takes the results so far and runs
spinners' scaffolding routines with the given 
parameters (without remapping).  The idea is
to progress to more aggressive settings after
more obvious joins have been made, as previous 
joins reduce the complexity of the graph.


------------------------------------------------------
The most basic commands is

STAGE

Which begins the parameters for each stage (helpful to accompany with
a comment giving the number of stages so far).

All "required" stage parameters are required for first stage and when
changed.  "Optional" have a default value but, like required fields,
preserve any value set through the stages unless explicitly changed.
The order has no significance.

The two most important parameters are given here first:

PAIR_THRESHOLD <number>

 Required: number of pairs linking two scaffolds needed to
 consider the connection ("edge") as a candidate to join (although 
 weaker links can sometimes still block others from joining).
 smaller number are more aggressive.  Required for first stage and when
 changed.

STRENGTH_TEST_RATIO <number>

 Required: when two edges conflict,  if the ratio of the 
 number of supporting pairs is less than this value, the weaker edge
 is discarded.  Larger values are more aggressive.  Required for 
first stage and when changed.

EXCLUDE_SMALL_CONTIGS <number | OFF >

 Required: ignores scaffolds of length smaller than the parameter for
 all scaffolding purposes at this stage.  Comment out for no cutoff.

PREVENT_TWO_SMALL_SCAFFOLDS_JOINING <number | OFF >

 Required: Joins between two scaffolds both smaller than this value are 
 stopped (although they may still prevent other joins to the 
 same scaffolds).

FIND_SIGMAS <number, default = 5>
MIN_UNCERTAINTY <number, default = 100>

 Optional: When estimating relative positions of scaffolds,
 an overlap is called when relative positions +- an error estimate
 mulitplied by FIND_SIGMAS, or MIN_UNCERTAINTY, whichever is smallest.
 More overlaps mean more conflicts, preventing joins, so higher values
 are more aggressive.  Higher values may help for bad quality libraries.
 Required for first stage and when changed.


JUMP_LENGTH <number | OFF default = OFF>

 Optional:  when looking for the next scaffold to join to scaffold X,
 spinner estimates the position of the scaffold relative to X. With
 this option, if no unconflicted "closest" edge can be found the usual way,
 scaffolds which *entirely* fall into an interval of this size next to
 X are ignored allowing the scaffolding to "jump over" short-range
 conflicts.  Note that this has a different affect from excluding short 
 libraries -- a short gap to a long contig would not be ignored.
 This is an aggressive option that should not be aplied in the 
 first scaffolding attempt and should be preceeded by less aggressive stages.
 as with required parameters, stays set for remaining stages unless changed.
 Problems occur when erroneous edges are no longer blocked by a true edge;
 this parameter should not be much longer than the longest insert.

-----------------------------------------------
Advanced options
 The following optional parameters have default values and are unlikely to be
 needed most users.

GOOD_GAP_ESTIMATES <ON|OFF default =ON>

 In the scaffolding process, gaps are estimated using a simple average.
 Once a join is made, a maximum likelihood estimate against the real distribution
 of inserts (calculated from same-contig pairs) or a normal distribution is used.
 Setting this to OFF turns of the MLE, which saves time at the cost of good final
 gap estimates (this can mildly affect overall length results too).

VERBOSE <ON|OFF default = OFF>

 When turned on this produces a long stream of output that "explains"
 every link made.  Search for a contig name to see when it was involved
 in join decisions.

CHIMERA_SIGMAS <number, default=5>

  pairs which imply a very negative gap size are probably chimeras
 (or else there is something unusual going on to do with heterozygosity
 or misassembly). If the pair implies a gap size more nagative than
 this number times the libraries standard deviation, it is considered a bad
 pair and not counted in gap estimation, strength test etc.

MATCH_SCORE_THRESHOLD <number default=15>

 Spinner rejects reads with a smalt match score less than this.  This gets rid of
 matches that could have been in more than one place.


CROSS_BIOTIN_PARAMETER <number default = 20>

 Spinner rejects a read pair if one has a near perfect SW score and the other
 misses the perfect score by this paremeter or more.  These can often be
 cross-biotin pairs.  20 does not reject many for illumina data.
 
DIRECTION_TEST_RATIO <number default =3>

 when edges between two scaffolds have conflicting direction properties
 they are discarded if the strongest one is this much stronger than the next
 strongest.

WEAK_EDGE_STRENGTH_RATIO <number default= max (strength ratio, lowest pair threshold / current threshold)>
 
 Conflicts can involve edges weaker than the threshold strength, as explained
 above.  The strength ratio applied in this case is set as a function of
 the minimum threshold used and the usual ratio, unless overridden.


#################################
STAGE
PAIR_THRESHOLD 30
STRENGTH_TEST_RATIO 0.0
EXCLUDE_SMALL_CONTIGS 3000
PREVENT_TWO_SMALL_SCAFFOLDS_JOINING 5000

# here we choose a high pair threshold, exclude a lot of small contigs, and
# turn off the strength test so that all conflicts are left unresolved.
#################################
STAGE
EXCLUDE_SMALL_CONTIGS 500

# perhaps we missed some joins by excluding small contigs.
#################################
STAGE
PAIR_THRESHOLD 15
EXCLUDE_SMALL_CONTIGS 3000

#################################
STAGE
EXCLUDE_SMALL_CONTIGS 500

#same again with lower threshold.
#################################
STAGE
PAIR_THRESHOLD 30
STRENGTH_TEST_RATIO 0.5

# now turn on strength ratio for some more aggressive stages.
#################################
STAGE
PAIR_THRESHOLD 15

#################################
STAGE
PAIR_THRESHOLD 8

#################################
STAGE
PAIR_THRESHOLD 6

#################################
STAGE
PAIR_THRESHOLD 8
JUMP_LENGTH 2000

# jump length can now move on to favour long-insert joins 
# that may be being blocked by erroneous short connections.
#################################
STAGE
JUMP_LENGTH 4000

#################################
STAGE
JUMP_LENGTH 10000

#################################
STAGE
JUMP_LENGTH 20000

#################################
STAGE
JUMP_LENGTH OFF
PREVENT_TWO_SMALL_SCAFFOLDS_JOINING OFF
# add more stages if it seems to help
#################################
STAGE
PAIR_THRESHOLD 6
JUMP_LENGTH 4000
#################################

Source: spinner.README, updated 2014-08-22