Roche 454 Platforms
From wgs-assembler
Celera Assembler supports 454 data. The 454 platform includes a series of DNA sequencing machines and software sold by 454 Life Sciences, a division of Roche Diagnostics. As of 2008, three models have appeared in series and each new model made its predecessor obsolete.
Contents |
454 GS 20
This platform, available since 2005, is also known as the 454 Genome Sequencer. The machine produces 20Mbp per run in reads of ~100bp each. As described in Nature (Marguilies 2005), it is capable of deciphering a small bacterial genome in one run. A supplement to the publication described the Newbler assembler for 454 reads. Newbler is capable of mapping reads to a reference. It is also a de novo assembler, like Celera Assembler.
Celera Assembler support for the GS 20 was described in PNAS (Goldberg 2006). The Goldberg method is indirect. The GS 20 reads are first assembled by Newbler. Newbler's large contigs are selected by a size filter. The large contigs are shredded into psuedo-Sanger reads. The Celera Assembler executive script, runCA, accepts the ACE file generate by Newbler, and generates fixed length (default=600bp) pseudo-reads from the Newbler contigs. These "shreds" are tiled to approximate the read coverage of the contigs. The shreds are processed by Celera Assembler in combination with true Sanger paired-end reads. The Celera Assembler exploits the Sanger mates to build longer contigs and scaffolds. The Celera Assembler output may or may not be a superset of the Newbler contigs.
As of 2009, Celera Assembler can accept GS 20 reads directly. It is necessary to convert GS 20 data files to Celera's FRG input format using sffToCA. For best results, the GS 20 reads should be combined with longer reads and paired-end reads from other platforms.
454 FLX
This platform, available since early 2007, is also known as 454 GS FLX Standard. This machine produces 100Mbp per run in reads of ~250bp each. When used on circularized or paired-end libraries, some (e.g. 30%) reads contain two mate tags per read. Each mate tag is ~100bp and the tags are separated by a 44bp palindromic linker sequence. With available library protocols, the size of the circle, and thus the mate separation on the genome, is about 3Kbp.
Celera Assembler support for FLX began in version 5.0. Version 5 was called CABOG since it used Celera Assembler's Best Overlap Graph. CABOG was described in Bioinformatics (Miller 2008). On bacterial sets of just FLX unpaired reads, CABOG was equivalent to Newbler. On data sets that included FLX plus paired-end mates from any source (FLX or Sanger), CABOG produced longer contigs and scaffolds than any other software. On very large data sets such as 8X from the human genome, CABOG was the only software capable of running to completion!
Celera Assembler's runCA executive script automatically launches the CABOG pipeline if any of its inputs are FLX reads. The equivalent manual commands to runCA are unitigger=bog and overlapper=mer. (The commands to disable CABOG are unitigger=utg and overlapper=ovl.) These commands can be supplied on the command line or in a spec file. Celera Assembler can read the SFF files generated by the FLX platform. On seeing SFF files, the runCA executive script invokes a stand-alone pre-processor called sffToCA. The pre-processor converts SFF to Celera's FRG file format. The pre-processor can parse FLX standard paired-end reads. The pre-processor copies the base calls and quality values (QV's) into its FRG output. By default, the pre-processor ignores all reads that include even one ambiguous base call (denoted N). By default, the pre-processor ignores the clear range in the SFF file and uses the entire read. (Celera Assembler's trimming module will determine a better clear range based on confirmation by other reads and not just QV's.) The pre-processor removes reads that are a perfect prefix of any other read. This overcomes the "perfect duplicates" problem common to FLX runs. For all duplicates to be removed, it is necessary to feed all the SFF files into the same run of the pre-processor. On every FLX read record in the FRG file, the pre-processor adds the tag forceBOGunitigger=1. This forces runCA to use the CABOG pipeline on the entire data set.
The 454 platform delivers sequence data in SFF files. The files are produced by 454's proprietary signal processing software. Recent versions (since 2008) produce better QV's than early versions. Our SFF parser detects the software version by searching for the XML element "<qualityScoreVersion>1.1.03</qualityScoreVersion>" in the SFF manifest. The parser will complain "WARNING: Fragments not rescored!" if this XML element is not found. In our experience, it would improve the assembly to discard these SFF files and generate new ones by re-processing the raw data with newer 454 signal processing software.
454 XLR
This platform, available since late 2008, is also known as the 454 GS FLX XLR Titanium. This machine produces 500Mbp per run in reads of ~350bp each. The range of read sizes is about 80bp to 500bp. When used on circularized or paired-end libraries, 30% - 50% of the reads contain two mate tags per read. Each mate tag is ~150bp and the tags are separated by the 42bp "recombi" XLR linker sequence. With available protocols, the genomic mate separations are either 3Kbp or 20Kbp.
Celera Assembler supported Titanium starting with the 5.3 release in February 2009.
For best assemblies, the data should include a high density of paired-end mated reads. Low mate density will lead to fractured scaffolds and lots of sequence in degenerate contigs. We are working on algorithms to repair the fracture given low mate density.
An XLR run on a microbe gives very high coverage. The high coverage can lead to fractured contigs or lots of degenerates. One solution is to assemble a random subset of reads. We are working on coverage-sensitive thresholds and that might remedy this problem.
With XLR reads, obey the clear range predicted by 454 software. The sequence beyond the clear range is generally bad. Celera Assembler's overlap-based trimming, OBT, does not recover good sequence beyond the clear range. OBT does sometimes restrict the clear range further to augment overlaps. Note the 454 clear range sequence sometimes includes ambiguous base calls (the letter N).
There is a known problem with Celera Assembler on XLR Titanium data. The mer version of the overlapper is recommended for all 454 data. However, it can run for very long times on XLR data. The behavior is inconsistent and the run time has nothing to do with genome size. Large genomes have run very quickly. We are investigating. For a work-around, we recommend trying the mer overlapper first, but if it runs too long, fall back to the ovl overlapper. One of the commands overlapper=mer or overlapper=ovl can go in your runCA spec file.
Standard Operating Procedures
See the SFF SOP for converting SFF files to FRG.
