Name Modified Size Downloads / Week Status
Parent folder
Totals: 5 Items   23.8 MB 6
README 2010-04-30 13.7 kB 11 weekly downloads
wgs-6.1-Darwin-i386.tar.bz2 2010-04-30 5.0 MB 11 weekly downloads
wgs-6.1-FreeBSD-amd64.tar.bz2 2010-04-30 6.3 MB 11 weekly downloads
wgs-6.1-Linux-amd64.tar.bz2 2010-04-30 10.7 MB 22 weekly downloads
wgs-6.1.tar.bz2 2010-04-30 1.9 MB 11 weekly downloads
These are Release Notes for Celera Assembler version 6.1, released May 1st, 2010. This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation. The source code package includes full source code, Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1821), used by some modules of Celera Assembler, is included. This package was prepared by scientists at the J. Craig Venter Institute (http://www.jcvi.org/) with funding provided by the National Institutes of Health (http://www.nih.gov/). Full documentation can be found online at http://wgs-assembly.sourceforge.net/. Citation Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], and Sanger+pyrosequencing hybrid assembly [Goldberg et al. 2006; Miller et al. 2008]. There are links to these papers, and more, in the on-line documentation (http://wgs-assembler.sourceforge.net/). Compilation and Installation The source code package needs to be compiled and installed before it can be used. The following commands will work on nearly all Unix-like platforms. % bzip2 -dc wgs-6.1.tar.bz2 | tar -xf - % cd wgs-6.1 % cd kmer % sh configure.sh % gmake install % cd ../src % gmake % cd .. Binary distributions need only be unpacked. % bzip2 -dc wgs-6.1-Linux-amd64.tar.bz2 | tar -xf - Example Here is one example of how to use Celera Assembler. . We download DNA sequence from the NCBI TraceDB hosted by the National Institutes of Health. We request paired-end fragment reads generated by Sanger chemistry on a bacterial genome. We convert the data to Celera Assembler format (a *.frg file). Then we launch the assembler software. Please see the expanded version of this example for more details. % ftp ftp ftp.ncbi.nih.gov ftp> cd pub/TraceDB/porphyromonas_gingivalis_w83 [...] ftp> ls -l 229 Entering Extended Passive Mode (|||50055|) 150 Opening ASCII mode data connection for file list -r--r--r-- 1 ftp anonymous 803288 Feb 19 17:29 anc.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 226576 Feb 19 17:29 clip.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 10261474 Feb 19 17:29 fasta.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 18039942 Feb 19 17:29 qual.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 1143755 Feb 27 08:01 xml.porphyromonas_gingivalis_w83.001.gz 226 Transfer complete. ftp> bin ftp> prompt ftp> mget fasta* qual* xml* ftp> bye Convert the NCBI files to files suitable as Celera Assembler input. The last trace-to-frg step takes about a minute. The 'ls' command demonstrates the expected output files. % perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz % perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz WARNING: creating bogus distance for library 'T0611' WARNING: creating bogus distance for library 'T0612' WARNING: creating bogus distance for library 'T13146' WARNING: creating bogus distance for library 'T24315' porphyromonas_gingivalis_w83.001: frags=40039 links=17113 (85.48%) % perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz Found 20020 in tafrg-porphyromonas_gingivalis_w83/porphyromonas_gingivalis_w83.001.frglib. % ls -l *.frg* -rw-r--r-- 1 bri bri 550 Jul 15 21:54 porphyromonas_gingivalis_w83.1.lib.frg -rw-r--r-- 1 bri bri 19119148 Jul 15 21:55 porphyromonas_gingivalis_w83.2.001.frg.bz2 -rw-r--r-- 1 bri bri 704432 Jul 15 21:54 porphyromonas_gingivalis_w83.3.lkg.frg Run Celera Assembler with runCA. % time perl wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* [...lots of status output...] [...about four minutes later...] % ls -l testassembly/9-terminator/*fasta -rw-r--r-- 1 bri bri 2373611 Jul 15 22:07 testassembly/9-terminator/pging.ctg.fasta -rw-r--r-- 1 bri bri 39177 Jul 15 22:07 testassembly/9-terminator/pging.deg.fasta -rw-r--r-- 1 bri bri 2392028 Jul 15 22:07 testassembly/9-terminator/pging.scf.fasta -rw-r--r-- 1 bri bri 445876 Jul 15 22:07 testassembly/9-terminator/pging.singleton.fasta -rw-r--r-- 1 bri bri 2692602 Jul 15 22:07 testassembly/9-terminator/pging.utg.fasta Changes Celera Assembler 6 made major changes to the format of the data stores used during an assembly. Because of this, it is not possible to use any previous version of Celera Assembler with data generated by Celera Assembler 6.1. Likewise, Celera Assembler 6.1 is not able to use data generated by any previous version. Specifically, any assemblies started with any previous version cannot be finised with Celera Assembler 6.1; they must be recomputed from scratch. Features Available Starting in 6.1 1. More Reads: Designed for Sanger sequencing technology, Celera Assembler 4 could comfortably handle tens of millions of reads. Celera Assembler 5 eventually handled hundreds of millions of reads. Celera Assembler 6.1 should be able to handle up to a billion reads, though we haven't tested this. 2. Duplicate 454 Reads: In previous versions, 454 reads that were a prefix of some other 454 read were removed, but only if the prefix of the reads were identical. Starting in Celera Assembler 6.1, errors are allowed in the prefix match. Additionally, duplicate mate-pairs derived from the same molecule are detected and removed. 3. Illumina Data: Celera Assembler 6.1 introduces native support for DNA sequencing reads from the Illumina (or Solexa) platform. Celera Assembler reads three varieties of fastq files, though it requires an FRG meta-data file that can be generated with Celera Assembler 's fastqToCA utility. Celera Assembler 6.1 has been validated using 100x of Illumina GA II 100bp reads, paired at 200bp insert size, from escherichia coli. The quality of support for shorter reads or larger genomes is still unknown. The minimum read and overlap lengths remain at 64bp and 40bp, respectively, but these can be adjusted at compile time. Celera Assembler has almost no code specific to the sequencing platform. Therefore, Celera Assembler can assemble mixtures of read types including Sanger, 454, and Illumina. 4. Consensus: Celera Assembler 6.1 experiences many fewer consensus failures than Celera Assembler 5. The consensus code is more reliable due to extensive refactoring, problem discovery, and improvement. 5. Phasing: By user request, there is a new option to control the phasing of variantes. With phasing off, the majority allele at every variant column is promoted to the consensus. With phasing on, the choice of majority allele could reflect a group of neighbor variant columns. As a result, the minority base at some columns could get promoted to the consensus. Phasing was deemed unacceptable for some applications. Phasing was always enabled in Celera Assembler 5; it is disabled by default in Celera Assembler 6. See the cnsPhasing option. 6. Closure/Finishing: Celera Assembler 6.1 offers additional support for finishing. This refers to improving the quality of an assembly by analysis, manipulation, targeted sequencing, and re-assembly. The pipeline now accepts closure reads and their placement constraints. Specifically, the input FRG format has a new PLC placement message type. The assembler considers these constraints, not as absolutes, but on par with other information. It uses them, not only to close gaps, but throughout the pipeline. The constraints could be derived from PCR primer sites and apply to the corresponding PCR reads, for example. 7. Pipeline Engineering: The traditional Celera Assembler used its own message passing interface wherein information passed from module to module in large (text or binary) files. Over the years, Celera Assembler has shifted to the use of its own database structure, called Store. Previous versions of Celera Assembler introduced the read database (gkpStore) and the overlap database (ovlStore). Celera Assembler 6.1 introduces a database for unitigs and contigs (tigStore) to replace the *.cgb message files. This change reduces I/O. It also facilitates debugging and mid-course alterations to problematic assemblies. There are utilities for dumping, querying, and updating the database. Improvements 1. Unitig Toggling: The unitig toggling process is supported directly from runCA. The algorithm now recognizes and toggles an additional case: a false-repeat unitig preventing the join of two scaffolds by being placed on the end of one scaffold and the beginning of the other. 2. Scaffolding: An experimental option was added to eject a contig in a scaffold when it has no overlaps to its neighbors ([[runCA#). 3. Scaffolding: Experimental option to revert current gap in a scaffold to a saved state while processing rocks/stones to see if an overlap now exists. 4. Long 454 Reads: Very long bad reads sometimes generated on 454 platforms are correctly trimmed. 5. Disk Usage: The intermediate output files from the overlap stage are now removed after the corresponding overlap store is created. 6. Automatic Selection of Unitigger: runCA automatically selects the correct unitigger to use based on the input fragments. Previously it was not possible to override this selection in some instances. It is now possible to specify the unitigger to use. 7. Scaffolding: An ancient bug was discovered and fixed. When evaluating if a pair of scaffolds can be merged, the number of satisfied and unsatisfied mate pairs are examined. If the merged scaffolds add too many unsatisfied mate pairs, the merge is not attempted. This examination was flawed, resulting in mate pairs never preventing a scaffold merge. The merge could still be prevented if there is no sequence alignment between the scaffolds. 8. Several existing options were removed: * bogEjectUnhappyContain was removed. It is always enabled now. * cgwOutputIntermediate was removed. The need for this option was removed by allowing terminator to read input directly from the tigStore and a CGW checkpoint file. 9. Several existing options were renamed: * doOverlapTrimming was renamed to doOverlapBasedTrimming. The original option is still allowed, but will be removed in a future release. * bogPromiscuous was renamed to bogBreakAtIntersections. Note that the sense of this option is reversed: bogPromiscuous=1 is the same as bogBreakAtIntersections=0. 10. Several new options were added: * utgErrorLimit was added. * stopBefore was added. * sgeName was added. * doDeDuplication was added. * doMerBasedTrimming was added. * saveOverlaps was added. * kickOutNonOvlContigs was added. * doUnjiggleWhenMerging was added. * cnsPhasing was added. * toggleMaxDistance was added. * toggleDoNotDemote was added. Bug Fixes The full list of changes is available, however many of these changes are due to problems added since Celera Assembler 5.4 was released. Major bug fixes relating to bugs present in previous releases are listed in the "Improvements" section above. Known Problems These may be fixed in future releases. 1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries. 2. There is no explicit support for the high-coverage induced by XLR runs on small genomes. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly. 3. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. This ratio happens when XLR is applied to many fragment libraries but few paired-end libraries. This ratio impedes the assembly of unitigs into contigs and scaffolds. It leads to fewer bases in scaffolds and more bases in degenerate contigs. 4. There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files. 5. There is no support for data from ABI SOLiD. 6. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort. Legal Copyright 1999-2004 by Applera Corporation. Copyright 2005-2009 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.
Source: README, updated 2010-04-30