Download Latest Version scaff10x-v4.1.tar.gz (3.0 MB)
Email in envelope

Get an email when there's a new version of Phusion2

Home / scaff10x
Name Modified Size InfoDownloads / Week
Parent folder
README 2019-04-02 10.4 kB
scaff10x-v4.1-update.pptx 2019-04-02 2.1 MB
scaff10x-v4.1.tar.gz 2019-04-02 3.0 MB
scaff10x-v3.1.tar.gz 2018-09-21 1.7 MB
scaff10x-v2.1.tar.gz 2018-03-12 1.6 MB
break10x-methods.pdf 2018-01-29 5.0 MB
scaff10x-v1.1.tar.gz 2017-11-13 1.3 MB
scaff10x-methods.pdf 2017-10-17 3.3 MB
scaff10x.tar.gz 2017-10-17 1.4 MB
Totals: 9 Items   19.6 MB 0
=============
Scaff10X v4.1
=============
Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads.

Pipeline steps:
        
    Scaffolding with scaff10x:
      1 Barcoded tags are extracted from 10Xg raw sequencing reads and appended 
          to read names for further processing
      2 The reads are mapped to the draft assembly using either BWA or SMALT
      3 Barcodes are sorted together with contigs as well as mapping coordinates
      4 A relation matrix is built to record the shared barcodes among the contigs which may be linked
      5 Order and orientation of linked contigs are determined after nearest neighbours are found. 
      
    Breaking scaffolds (or contigs) with break10x:
      Steps 1, 2 and 3 are the same as for scaff10x
      4 The barcode depth is monitored for each scaffold, if this is reduced by more than 15% with 
        respect to the scaffold average barcode depth, the point of minimum depth is considered to 
                be an assembly error and the scaffold is broken at that point. Exception are points close to 
                long N-gaps unless the N-Gaps was added by scaff10x.
      5 The location of breaking points are printed out in a file and the broken scaffolds are 
        printed in a fasta file
      
=====================
Download and Compile:
=====================
Requirements for compiling: gcc gcc-4.9.2 or late:

If you see this message,
cc1: error: unrecognised command line option ‘-std=c11’
make: *** [break10x.o] Error 1

you need a higher version of gcc
CC= /software/gcc-4.9.2/bin/gcc in the makefile


    $ git clone  https://github.com/wtsi-hpag/Scaff10X.git 
    $ cd Scaff10X
    $ ./install.sh
                
If everything compiled successfully you must see the final comment: 
                "Congrats: installation successful!"            

=================
External packages
=================
The genome aligner BWA (http://bio-bwa.sourceforge.net) and SMALT (http://www.sanger.ac.uk/science/tools/smalt-0) are downloaded and compiled by Scaff10X.

=================
Run the pipelines
=================

 Run scaff10x:
           $ /full/path/to/Scaff10X/src/scaff10X -nodes <nodes> -align <aligner> -score <score> \
                 -matrix <matrix_size> -read-s1 <min_reads_s1> -read-s2 <min_reads_s2> -longread <aggressive> -gap <gap_size> \
                 -edge <edge_len> -link-s1 <n_links_s1> -link-s2 <n_links_s2> -block <block>  \
                 [ -data input.dat ] [ -sam input.sam ] [ -bam input.bam ] \
                 draft-asssembly.fasta output_scaffolds.fasta
           

 Parameters:
             nodes:        number of CPUs requested  [ default = 30 ]
             score:        averaged mapping score on each barcode fragment [ default = 20 ]
             aligner:      sequence aligner: bwa or smalt [ default = bwa ]
             matrix_size:  relation matrix size [ default = 2000 ]
             min_reads_s1: step 1: minimum number of reads per barcode [ default = 10 ]
             min_reads_s2: step 2: minimum number of reads per barcode [ default = 10 ]
             edge_len:     length of mapped reads to consider for scaffolding [ default = 50000 ]
             n_links_s1:   step 1: minimum number of shared barcodes [ default = 8 ]
             n_links_s2:   step 2: minimum number of shared barcodes [ default = 8 ]
             aggressive:   1 - aggressively mapping filtering on small PacBio/ONT contigs; 
                           0 - no aggressive for short read assembly  [ default = 1 ]
             block:        length to determine for nearest neighbours [ default = 50000 ]
             gap:          gap size in building scaffold [ default = 100 ]
             
 Files
      ==========
      input.dat:   input a text file to point the locations of the reads in paired files

q1=/lustre/scratch116/vr/projects/Tes1_S1_L008_R1_001.fastq.gz \
q2=/lustre/scratch116/vr/projects/Tes1_S1_L008_R2_001.fastq.gz \
q1=/lustre/scratch116/vr/projects/Tes1_S2_L008_R1_001.fastq.gz \
q2=/lustre/scratch116/vr/projects/Tes1_S2_L008_R2_001.fastq.gz \
q1=/lustre/scratch116/vr/projects/Tes1_S3_L008_R1_001.fastq.gz \
q2=/lustre/scratch116/vr/projects/Tes1_S3_L008_R2_001.fastq.gz \
q1=/lustre/scratch116/vr/projects/Tes1_S4_L008_R1_001.fastq.gz \
q2=/lustre/scratch116/vr/projects/Tes1_S4_L008_R2_001.fastq.gz \
 
 The scaff10x pipeline will read the gzipped files, trim the barcodes and pipe to bwa for alignment 
 There will be no sam file anymore in this way. The input.dat can be local or with full path

      ==========
      input.sam:   input a sam file if it already exists, 
                   and skip the mapping (Optional, please provde full path)

      ==========
      input.bam:   input a bam file which had been prodcued by using lariat in longranger, 
                   and skip the mapping (Optional, please provde full path)
        (a). rename the assembly file (Optional):
           $ /full/path/to/Scaff10X/src/scaff-bin/scaff_rename your_assembly.fa longrang_refasm.fa
        (b). generate reference assembly file using longranger
           $ longranger mkref longrang_refasm.fa 
        (c). align 10x reads using lariat
           $ longranger align --fastq="reads_10x" --sample=fTakRub1 --reference="longrang_refasm" --localcores=50 --id=10x-align 
           Note: please provde full path
        (d). run scaff10x 
           $ scaff10X -bam /lustre/scratch117/possorted_bam.bam draft-asssembly.fasta output_scaffolds.fasta 
             draft-asssembly.fasta:   initial draft assembly to scaffold (full path or local)
            output_scaffolds.fasta:   name for the output scaffolded assembly (local)

===========================
Some notes and suggestions:
===========================
            
        a. SMALT is notably slower than BWA. So we suggest to try BWA first;
        b. The block value is very important. The default value of 2500 is very conservative
           and you may increase this value to say 5000 or 10000 to improve the length of scaffolds; 
        c. The default numbers of -reads and -link are based on 30X read depth. 
           These values should be increased if the read deapth is higher
        d. To get the best results we suggest an iterative approach, 
           where scaff10x scaffolds (output_scaffolds.fasta) are scaffolded again using scaff10x.
        e. Alignments with mapping score < score are filtered out to reduce linking errors;
        f. By using the option of "-longread 1", the pipeline performs an aggressive 
            mapping score filtering on small PacBio/ONT contigs.  


 Remember: you only need to run scaff10x once (previously we suggested two iterations)

=============
Run break10x:
=============
           
        $ /full/path/to/Scaff10X/src/break10x -nodes <nodes>  -score <score> -reads <min_reads> \
          -gap <gap_size> -cover <cover> -ratio <ratio> -data <input.dat>\
          scaffolds.fasta scaffolds-break.fasta scaffolds-break.name           
           

    Parameters:
             input.dat:  input a text file to point the locations of the reads in paired files, see the file format for scaff10x
             nodes:      number of CPUs requested  [ default = 30 ]
             score:      minimum average mapping score on an area covered by reads with 
                         the same barcode [ default = 20 ]
             min_reads:  minimum number of reads per barcode [ default = 5 ]
                 cover: minimum barcode coverage at the breakpoint [ default = 50 ]
                 gap:  scaffold gap size added by scaff10x (if used). 
                   If a breakpoint is close to a gap region, break10x checks if the gap was added by  
                   scaff10x when joining two contigs (using the same 10X data). 
                   If it was, the scaffold is broken. If the scaffolding was done by other 
                   means (different length from scaff10x gap added), it will not be broken as the 
                   gap could be very big and 10X barcodes might not cross over. [ default = 100 ]
        
break10x output

The command break10x will output the assembly after scaffold breaking (scaffolds-break.fasta) and
the list of breaking points.

The list of breaking point  in scaffolds-break.name has the format:

        Break2: Scaffold_name  Scaffold_index  Position Scaffold_Length Minimum_Coverage Average_Coverage  Gap_Length
        
        Scaffold_name   name of scaffold broken by break10x
        Scaffold_index  index of scaffold broken by break10x
        Position        Break-point position
        Scaffold_length  length of scaffold
        Minimum_Coverage  minimum barcode coverage in the breakpoint region
        Average_Coverage  scaffold average barcode coverage
        Gap_Length        gap length ( 0 indicates a contig break)

Example:
        
        Break2: Scaff10x_0 0 1527414 15892392 1 114 100
        Break2: Scaff10x_0 0 7246257 15892392 0 114 100
        Break2: Scaff10x_2 2 17232850 17798733 21 176 100
        Break2: Scaff10x_9 9 3709193 3904076 37 451 100
        Break1: Scaff10x_10 10 3713358 5150262 0 131 0

===================
Other applications:
===================
           
Process barcodes and output paired gzip files
    
     $ /full/path/to/Scaff10X/src/scaff_reads -nodes <nodes>  input.dat genome-BC_1.fastq.gz genome-BC_2.fastq.gz  \
       Here input.dat is a text file to point the locations of the reads in paired files  \
       The paired reads can also be used for other applications such as genomeScope or jellyfish etc
            
Run scaff10x in another way

      $ /full/path/to/Scaff10X/src/scaff10X -nodes <nodes> -align <aligner> -score <score> \
        -matrix <matrix_size> -read-s1 <min_reads_s1> -read-s2 <min_reads_s2> -longread <aggressive> -gap <gap_size> \
        -edge <edge_len> -link-s1 <n_links_s1> -link-s2 <n_links_s2> -block <block>  \
        draft-asssembly.fasta genome-BC_1.fastq.gz genome-BC_2.fastq.gz output_scaffolds.fasta \
            
Run break10x in another way   

      $ /full/path/to/Scaff10X/src/break10x -nodes <nodes>  -score <score> -reads <min_reads> \
        -gap <gap_size> -cover <cover> -ratio <ratio> \
        scaffolds.fasta genome-BC_1.fastq.gz genome-BC_2.fastq.gz scaffolds-break.fasta scaffolds-break.name 

Source: README, updated 2019-04-02