Celera Assembler : scientific software for biological research. Celera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing. Celera Assembler has enabled many advances in genomics, including the first whole genome shotgun sequence of a multi-cellular organism (Myers 2000) and the first diploid sequence of an individual human (Levy 2007). Celera Assembler was developed at Celera Genomics starting in 1999. It was released to SourceForge in 2004 as the wgs-assembler under the GNU General Public License. The pipeline revised for 454 data was named CABOG (Miller 2008).
Celera Assembler can use any combination of reads from:
- dideoxy (Sanger) sequencing platforms such as the Applied Biosystems 3730 DNA Analyzer and 3730xl DNA Analyzer
- pyrosequencing platforms such as the 454 Life Sciences Genome Sequencer FLX Titanium and GS Junior.
(Reads from the discontinued Genome Sequencer FLX before Titanium reagents and Genome Sequencer 20 are supported as well.)
- sequencing by synthesis platforms such as the Illumina HiSeq 2000, Genome Analyzer IIx and Genome Analyzer IIe.
(Reads shorter than 75bp are not supported.)
- single-molecule sequencing platforms such as the Pacific Biosciences PacBio RS (after correction using the pacBioToCA pipeline.)
- Requirements for running Celera Assembler.
- List of all released versions with release notes and errata.
- Celera Assembler Terminology and Theory.
- runCA, the main program for running Celera Assembler.
- Spec files, how to configure a Celera Assembler run.
- Best Practices
- RunCA Dissection, a step-by-step explanation of what is going on (out of date, but still generally applicable).
- Yersinia pestis KIM D27, using 454 8 Kbp mated reads, with CA8.1 (with CA8.0)
- Yersinia pestis KIM D27, using Illumina paired-end reads, with CA8.1 (with CA8.0)
- Porphyromonas gingivalis W83, using 454 3 Kbp mated reads, with CA8.1 (with CA8.0)
- Escherichia coli K12 MG1655, using corrected PacBio reads with CA8.1
- Escherichia coli K12 MG1655, using uncorrected PacBio reads, with CA8.1 (with CA8.0)
- Homo sapiens, J. Craig Venter, using Sanger reads, with CA8
- Older examples
The Celera Assembler expects input fragment data to be in the FRG format. We provide several utilities for converting a variety of data types into this format:
- fastaToCA - converts sequence and quality values in fasta format.
- tracearchiveToCA - converts xml, qual and fasta from the NCBI TraceDB into FRG format.
- sffToCA - converts 454 SFF files into FRG format, optionally searching each read for 'linker' sequence indicating the read is a pair of mated reads.
- fastqToCA - generates a FRG file that allows direct loading of Illumina FastQ files.
- pacBioToCA - A correction pipeline for PacBio RS sequencing data. Uses only PacBio RS sequences or short-read technologies to generate high-accuracy consensus. The output is a FRG file (along with fasta and qual).
CA 8.1 Release
CA 8.0 Release
CA 7.0 Release
Users of Celera Assembler are encouraged to sign up to the wgs-assembler-users mailing list. The list is intended for discussion on using Celera Assembler. We'll announce new releases, new features and bug fixes too. Bug reports should still be reported to the bug tracker.
User Group Meeting: Jan 2012
The J. Craig Venter Institute will host the CAUG 2012 Celera Assembler User Group Meeting Thursday & Friday, 12-13 January 2012. Contact us about registration (ATGatJCVIdotORG). The format will be similar to the CAUG 2010 of 26-27 August 2010. Thanks to all 30 participants from around the world, and to the U.S. National Institute of General Medical Sciences (NIGMS) for funding.
CA 6.1 Release
Celera Assembler 6.1 was released on April 30th, 2010. This is the first version with support for Illumina sequence data. See Releases, fastq support, release notes, the change log, errata, and test results.
The J. Craig Venter Institute will hire summer interns to work on a variety of scientific endeavors including the Celera Assembler software. Students at the graduate, undergraduate, and high school levels should apply through the JCVI Internship Program. Funding for Celera Assembler internships is provided by a grant from the National Institute of General Medical Sciences (NIGMS). It is too late to apply for a summer 2011 position so please apply in regard to future semesters.
- Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204.
- Venter et al. (2001) The Sequence of the Human Genome. Science 291 1304-1351.
- Mural et al. (2002) A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome. Science 296 1661-1671.
- Holt et al. (2002) The Genome Sequence of the Marlaria Mosquito Anopheles Gambiae. Science 298 129-149.
- Zdobnov et al. (2002) Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster. Science 298 149-159.
- Fasulo et al. (2002) Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics 18 Supp(1):S294-302
- Istrail et al. (2004) Whole-Genome Shotgun Assembly and Comparison of Human Genome Assemblies. PNAS 101 1916-1921.
- Venter et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304 66-74.
- Goldberg et al. (2006) A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. PNAS 103(43):16057
- Rhesus Macaque Consortium (2007) Evolutionary and Biomedical Insights from the Rhesus Macaque Genome. Science 316 222-234.
- Ghedin et al. (2007) Draft Genome of the Filarial Nematode Parasite Brugia malayi Science 21, September.
- Carlton et al. (2007) Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis. Science 315 207-212.
- Rusch et al. (2007) The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology 1821060.
- Levy et al. (2007) The Diploid Genome Sequence of an Individual Human. PLoS Biology 0050254.
- Denisov et al. (2008) Consensus Generation and Variant Detection by Celera Assembler. Bioinformatics 24(8):1035-40
- Miller et al. (2008) Aggressive Assembly of Pyrosequencing Reads with Mates. Bioinformatics 24(24):2818-2824
- Salzberg et al. (2008) Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae BMC Genomics.
- Zimin et al. (2009) A whole-genome assembly of the domestic cow, Bos taurus. Genome Biology 10:R42
- Miller et al. (2009) Shotgun Assembly of a Repetitive Plant Genome. Cucumber Poster
- Rausch et al. (2009) A consistency-based consensus algorithm for de novo... Bioinformatics 25(9):1118-1124
- Chapman et al. (2010) The dynamic genome of Hydra Nature March 14.
- Shulaev et al. (2010) The genome of woodland strawberry Nature Genetics December 26.
- Spanu et al. (2010) Genome expansion in powdery mildew fungi Science, December 10.
- Lorenzi et al. (2010) New assembly of Entamoeba histolytica PLoS Neglected Tropical Diseases, June 15
- Miller, Koren, Sutton (2010) Assembly algorithms for next-generation sequencing data. Genomics, March 6.
- Miller et al. (2010) Bonobo genome de novo assembly generated by CABOG. Bonobo Poster ISMB, Boston
- Dalloul et al. (2010) Multi-platform next-generation sequencing of domestic turkey (Meleagris gallopavo), PLoS Biology
- Kirkness et al. (2010) Genome sequence of the human body louse Science, July
- Koren, Miller, Walenz, Sutton (2010) Automated Closure Algorithm BMC Bioinformatics, September
- Lawniczak et al. (2010) Widespread Divergence Between Incipient Anopheles gambiae Species Revealed by Whole Genome Sequences. Science, October.
- Nelson, Weinstock et al. (2010) A Catalog of Reference Genomes from the Human Microbiome. Science, May 21.
- Inskeep, Rusch et al. (2010) Metagenomes from High-Temperature Chemotrophic Systems. PLoS One, March.
- O'Neal, Dzurisin et al. (2010) Population-level transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio zelicaon. BMC Genomics.
- Jones et al. (2011) Genomic insights into the physiology and ecology of the marine filamentous cyanobacterium Lyngbya majuscula. PNAS, 10.1073.
- Miller, Hayes et al. (2011) Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil). PNAS, June.
- Wóycicki, Witkowicz et al. (2011) The Genome Sequence of the North-European Cucumber (Cucumis sativus L.) PLoS one, July.
- Star, Nederbragt et al. (2011) The genome sequence of Atlantic cod reveals a unique immune system. Nature, August.
- Wang, Chen et al. (2011) The draft genome of the carcinogenic human liver fluke Clonorchis sinensis. Genome Biology, October.
- Walenz, Sutton, Miller (2011) Pair classification within Illumina mate pair data, Cold Spring Harbor Genome Informatics, November 2-5, 2011.
- Gillespie et al. (2012) A Rickettsia Genome Overrun by Mobile Genetic Elements Provides Insight into the Acquisition of Genes Characteristic of an Obligate Intracellular Lifestyle. Journal of Bacteriology, January.
- Prüfer, et al. (2012) The bonobo genome compared with the chimpanzee and human genomes, Nature, June 2012.
- Koren, Schatz, Walenz et al. (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads, Nature Biotechnology, July 2012.
- Tatti et al (2013) Draft Genome Sequences of Bordetella holmesii Strains ASM Genome Announcements.
- The National Institute of General Medical Sciences (NIGMS), NIH; Grant 2R01-GM077117-04A1.
- The J. Craig Venter Institute (JCVI)
- The University of Maryland Center for Bioinformatics and Computational Biology (CBCB)
- The J. Craig Venter Science Foundation (JCVSF)
- The Institute for Genomic Research (TIGR)
- Celera Genomics and Applied Biosystems
- The National Institute of Allergy and Infectious Disease (NIAID), NIH; Contract N01-AI-30071, "Microbial Genome Centers" (MSC).
- The National Institute of Allergy and Infectious Disease (NIAID), NIH; Contract HHSN266200400038C, "Bioinformatics Resource Centers for Biodefense and Emerging/Re-emerging Infectious Diseases" (BRC).