I have an assembly in Gap4 (from the latest Staden version 1.7) consisting of 10,000 fosmid reads and want to complement it with a large number of 454 reads. I have read that Staden can read this new file format, but also that phrap isn't good at assembling them. My question is, can I without too much problems, process the 454 reads in Pregap4 using e.g. RepeatMasker and Phrap, and then appending them to the existing Gap database.
If there are major difficulties assembling them into an existing Gap4 database, or if other non-Staden software is needed, I need to know this before I order the sequences.
Many thanks in advance!
We acquired a 454 sequencer about a year ago and were faced with the same problem as most of our projects incorporate this data with conventional ABI reads. I have not tried directly assembling the 454 reads using Staden because I’m told most assemblers don’t deal with this data very well. The read lengths are short and the quality (i.e. Phred-type values) is inherently low for individual reads. It is only through the high level of redundancy that you get with the 454 that the reliability of the data increases. So the route we took was to use 454’s assembler first and then incorporate the resulting contigs into our Staden projects. The 454’s assembler outputs the contig data (the fna file) and quality value data (the qual file) separately so I had our bioinformatics guys put together a simple Perl script that will merge these two files and spit out the individual contig files in Staden’s experiment file format. They were also kind enough to include generating a pregap.passed file. I’m not a programmer but I understand this wasn’t a difficult task (I talked to them after lunch and had the script the next morning) and the end result is that the sequences go into Staden seamlessly and the quality values are also there for editing purposes. The only disadvantage of the 454 data is that there are no chromatograms to view for editing purposes. The assembler does generate an ace file which shows how the reads have been assembled but I have yet to figure out a way to easily incorporate this into the editing process.
If you’re interested in the Perl script I’d be happy to pass it along. And if you have any other questions about 454 data in general I’ll do my best to answer them.
Genomics Core Facility
National Microbiology Laboratory
Public Health Agency of Canada
Just noticed a typo in my email address. It should read email@example.com
You can have a look at:
It is very new, and may still need some tweaking.
Prof Fourie Joubert
Bioinformatics and Computational Biology Unit
Department of Biochemistry
University of Pretoria
I'd recommend investigating the CAF file format and caf2gap as a way to go. CAF has the big advantage of one file for the entire assembly instead of millions of experiment files.
I initially wrote an ace2caf program, but it was a bit crude and buggy. Since then it's been reimplemented in a much better manner it would seem. See http://genome.imb-jena.de/software/roche454ace2caf/ for code and a brief tutorial.
PS. You definitely should be using 454's own assembler as it can make use of information other assemblers will not. For example in the simplest case of dealing with a homozygous/clonal sequence then you know TCT could be misread as TAT or TGT as valid sequencing errors, but reading a TTT would imply 4 entire flows being removed which realistically only happens with a real DNA change rather than base-calling error. Assemblers / SNP calling programs should take note of such things - ie be "flowgram aware", but as far as I know only 454's own assembler takes such things into consideration. It's also why it's so amazingly fast as working in flow-space avoids the need for most dynamic programming.