From: Walenz, B. <bw...@jc...> - 2012-11-20 12:05:59
|
Hi, Jens- First, I’d suggest upgrading to the cvs version (http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Check_out_and_Compile). Its a bit more work to install, but you’ll get a ton of fixes and optimizations. Plus, the answers below are for this version. Answering your questions out of order: The only requirement for a ‘library’ of reads is that they form a normal insert size distribution. It’s probably better to treat each 454 run as its own library unless you know for sure they came from the same library construction. You can mix paired- and single-ended reads in the same library. Note that the fastqToCA usage changed slightly from 7.0 to the cvs version. Use ‘-reads X.fastq’ and ‘-mates A1.fastq,B1.fastq’ to load SE and PE reads. Note that ‘-mates Y.fastq’ will expect mate pair reads interleaved. The maximum read length is set at compile time, in file AS_global.h, variable AS_READ_MAX_NORMAL_LEN_BITS. The default is 11 (=2047 bases). For PacBio, 13-15 (=8-32kbp) has been used. 16 has been reported to not work. For reads longer than the maximum, it depends on the fastqToCA technology. For tech ‘illumina’, the reads are truncated to the maximum ‘packed’ size (160bp by default). The ‘packed’ format is slightly more efficient storage designed for lots of short reads. The other technologies will also truncate reads, but to the NORMAL_LEN_BITS size. ‘gatekeeper –dumpinfo X.gkpStore’ will generate a table of the reads loaded, number mated, number deleted, and total bases. The ‘not a sequence start line’ errors, I think, were caused by the fastq reader only partially reading a sequence line. On the next input, it was expecting to find “@name” but found bases/qvs instead. In any case, it’s fixed in the cvs version. Just curious - are your reads longer than 2kbp real? I’ve seen these in the past, and they were mostly garbage. b On 11/20/12 5:33 AM, "Jens Hooge" <jen...@go...> wrote: Hi, I'm relatively new to NG and its tools, but at the moment I'm trying to run an assembly of about 70 single- and paired end 454 reads in FASTQ format, using the wgs-7.0-assembler. The version I've been using is the one from http://sourceforge.net/apps/mediawik...itle=Main_Page <http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page> I have converted my FASTQ files to FRG files using CABOGs fastqTOCA routine using a different library name for each FASTQ file. When I run the actual assembly though with runCA. I get an error message in melonAssembly.gkpStore.err. GKP finished with 11339450 alerts or errors: 11338139 # ILL Error: not a sequence start line. 1292 # ILL Error: not a quality start line. 19 # LIB Alert: suspicious mean and standard deviation; reset stddev to 0.10 * mean. To me this looked as if it was a problem with the format of my FASTQ files, so I ran a script to validate on format consistency of the files which resulted in no errors. Some of my reads are longer than 2047 bp and I have the feeling that the bug fix stated at http://sourceforge.net/apps/mediawik..._Release_Notes <http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Version_7.0_Release_Notes> under Bug fixes is not yet fixed in the version I'm using. Quote: "Gatekeeper: Numerous problems with reads longer than the maximum allowed (2047bp) and reads of very specific lengths were discovered and fixed. All of these resulted in gatekeeper crashing." Even though gatekeeper doesn't crash, I would expect about 25 million reads to be processed by CABOG, however while running the assembly I get a stdout print message saying "numFrags = 14499910". To me this looks like not all reads are being used for the assembly. If I add the number of ILL Errors, it comes suspiciously close to my expected number of reads though, which makes me think that CABOG just get's rid of the reads which are longer than the maximally allowed length of 2047 bp. My questions would be: What happens with reads that are longer than the maximally allowed length? Are those reads ignored or clipped to the maximum read length? Is there a way to adjust the maximum read length, to make CABOG use those reads in the assembly as well? Does every FASTQ file have to be added to a different gatekeeper library, or is it enough to put single ended and paired ended reads into their respective libraries? I would be very grateful if anyone could help me out. Ciao, Jens |