Welcome, Guest! Log In | Create Account

SFF SOP

From wgs-assembler

Jump to: navigation, search

Here are Standard Operating Procedures (SOP) for parsing and assembling DNA sequence data stored in SFF files. SFF is the file format generated by software on 454 sequencing platforms such as 454 FLX and 454 XLR. Unfortunately, SFF files do not contain enough meta-data to allow Celera Assembler to infer library or platform information. Thus, each (library,platform) data set must be processed separately.

Procedure

First convert your SFF files to FRG format with the sffToCA utility.

  • You can feed compressed files to sffToCA. Upon seeing the filename extension *.bz2 or *.gz, the sffToCA utility will pipe the file through bunzip2 or gunzip, respectively.
  • Run 'sffToCA' with no options to see the usage instructions.
  • Remove duplicate reads. DO NOT enable the -nodedup option. Duplicate reads are common in SFF files and they shatter Celera Assembler assemblies.
  • For every library, process all its SFF files in one sffToCA run. This will allow the program to recognize and remove duplicate reads across the library, even if those duplicates are spread across SFF files.
  • Pay attention to clear range differences between FLX and XLR Titanium. Since the SFF file does not say whether it came from FLX or XLR, you need to use sffToCA command-line parameters to achieve these recommended behaviors.
    • On FLX data, the reads tend to be short and reliable. The clear range usually includes almost the entire read. We recommend using the whole reads in their entirety, regardless of the 454 clear range. The corresponding option in CA 6.0 is 'sffToCA -trim none'.
    • On FLX data, reads rarely contain an N. The N-positive reads are sometimes problematic. Being cautious, we recommend discarding every read that contains even one N, whether the N is inside or outside the clear range. The corresponding option in CA 6.0 is 'sffToCA -clear discard-n'.
    • On XLR data, the longer reads tend to have trash at the 3' end. We recommend accepting the 454 clear range. The following applies to CA 6.0. For unpaired fragment runs, we recommend 'sffToCA -trim hard' which establishes the 454 clear range as the maximum. For paired-end runs, we recommend 'sffToCA -trim chop' which ERASES the bases outside the 454 clear range. This extra level of protection protects the linker detection algorithm from noticing linker-like sequence outside the clear range.
    • On XLR data, many reads contain an N. The 454 clear range sometimes includes a few bases beyond the start of N's. We recommend accepting the 454 clear range with 'sffToCA -clear 454. When high volumes of data allow it, we are more conservative by using 'sffToCA -clear n' or 'sffToCA -clear pair-of-n'. This re-sets the end of the clear range to just before the first N or first pair of N's, respectively.
  • For SFF files from paired-end libraries, pay attention to the linker. The linker is non-genomic sequence present in paired-end reads. The SFF file does not specify the linker explicitly, and sffToCA does not try to discover it. You need to identify the linker. Use the built-in linker behavior ('-linker flx' or '-linker titanium') if possible. If you must specify the DNA sequence, then use the '-linker' option once to provide the forward strand and again to provide the reverse-complement strand. (The FLX linker was a palindrome -- same string forward or reverse-complement. In that special case, one '-linker' option would do.)

Second, launch the runCA executive against your FRG files.

  • Read the extensive on-line user guide for runCA.
  • Use the mer overlapper, overlapper=mer. Unlike the ovl overlapper, the mer module is aware of homopolymer run length uncertainty. (Unfortunately, it runs slower on some XLR data. [Celera Assembler 5.3 version]).
  • The Best Overlap Graph (BOG) overlapper is enabled (and difficult to disable) for 454 data. Compared to the utg unitigger, the BOG unitigger is less sensitive to read length heterogeneity.
  • Set the runCA unitig error rate parameter. This parameter determines the tolerance for sequencing error. Put another way, it sets the maximum difference allowed between two reads for the overlap to be used in a unitig.
    • The default is 1.5% (0.015), which works well for Sanger, and Sanger + FLX mixes.
    • We've seen better results using 3.0% (0.03) with Titanium data.

Tip

Users of Celera Assembler 5.4 should consider using the sffToCA utility from the CVS tip. Starting in July 2009, we are improving the linker screen in that program. This is to repair a phenomenon observed in some 5.4 assemblies: linker sequence in contigs, placed at contig ends where it presumably prevented further contig extension. (Linker at contig middles was reported but not observed by us.) After a round of tuning, the sffToCA program detects and removes significantly more linker sequence.

The new sffToCA has more sensitive linker detection. In our tests on Titanium and FLX linker, the new version deletes reads with multiple linker, finds all the one-copy high-fidelity linker, trims all the medium-fidelity linker, marks all the low-fidelity linker such that OBT will treat it as possible contaminant and demand confirmation by spanning reads.

The new sffToCA is not yet tuned for efficiency. It is twice as slow as the one in the 5.4 release.

While sffToCA is compatible with CA 5.4, the rest of the programs in the CVS tip are not. In particular, the primary data store, the gkpStore, has changed.

Here is how to use sffToCA from the CVS tip.

  1. Do a cvs checkout into a fresh directory. Follow the 'checkout' instructions here.
  2. The code at the CVS tip uses features of the gnu C++ compiler. As an interim measure, you will have to convert to C++ before compiling: run the rename-to-c++.sh shell script in the src directory.
  3. Build the entire CA suite even though you need just one binary. Follow the 'make' instructions here.
  4. Convert your SFF files to FRG files with the newly built sffToCA.
  5. Move your build out of the way or delete it. Assemble your FRG files with the Celera Assembler release 5.4 pipeline.

Example

Suppose we want to assemble the reads in these SFF files:

  • FLX_A01.sff, FLX_A02.sff, FLX_A03.sff.bz2, FLX_A04.sff.gz - Library_A, unpaired, FLX, four files from two runs. Two files are compressed.
  • FLX_B01.sff, FLX_B02.sff - Library_B, paired, mean=3Kbp, stdev=300bp, FLX two half-plates.
  • XLR_C01.sff, XLR_C02.sff - Library_C, unpaired, XLR, two half plates.
  • XLR_D01.sff, XLR_D02.sff - Library_D, paired, mean=3Kbp, stdev=310bp, XLR sequence, FLX linker:
  • XLR_E01.sff, XLR_E02.sff - Library_E, paired, mean=20Kbp, stdev=1999bp, XLR sequence, XLR recombi linker:
% sffToCA -trim soft - clear discard-n -libraryname LIB_A -output A.frg FLX_A??.sff* 
% sffToCA -trim soft -clear discard-n -linker flx -insertsize 3000 300 -libraryname LIB_B -output B.frg FLX_B*.sff
% sffToCA -trim hard -clear 454 -libraryname LIB_C -output C.frg XLR_C*.sff 
% sffToCA-trim chop -clear 454 -linker flx -insertsize 3000 310 -libraryname LIB_D -output D.frg XLR_D*.sff 
% sffToCA -trim chop -clear 454 -linker titanium -insertsize 20000 1999 -libraryname LIB_E -output E.frg XLR_E*.sff 

% runCA -p mygenome -d mydir doOverlapTrimming=1 overlapper=mer unitigger=bog utgErrorRate=0.03 *.frg

Users may want to experiment with trim parameters on their own data. We have found that, for FLX standard reads, there isn't much bad sequence beyond the 454 clear range. The trim option makes little difference. Trim=soft lets CA use 454-trimmed bases if it chooses. For Titanium reads, we find lots of bad data at the read ends, so we choose to trust the 454 clear range. For Titanium unpaired reads, trim=hard may be slightly better than trim=chop. For Titanium paired reads, there is danger in letting CA see linker sequence outside the 454 clear range. Thus trim=chop seems safer. However, starting after the CA 5.4 release, sffToCA would not be confused by linker sequence outside the 454 clear range. Thus trim=chop and trim=hard will probably give equivalent results for people using the CVS tip or any post-5.4 version.