From: Walenz, B. <bw...@jc...> - 2012-05-11 19:00:37
|
On 5/10/12 2:55 PM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: > Hi Brian. > > Thank you for this, good to know. Our PacBio fastq files were over > multiple lines (SMRT-Portal 1.3... Thank you a lot PacBio!), and the > correction pipeline ran for 17 days taking up 48 CPUs and I guess we > can just kill it now. Multiple lines aren't nearly as bad as Illumina's new multi-word read names... ;-) The paper on the correction pipeline will be appearing in Nature Biotechnology real soon. I'll send a link once I get one. I'm pretty sure nobody has tried correcting pacbio with 454 reads. > > On 10 May 2012 19:50, Walenz, Brian <bw...@jc...> wrote: >> Hi, Ole- >> >> ovlHashLibrary=2 does mean to load only reads from the second library into >> the hash table. In this case, it's the pac bio reads. The 'ref' library is >> what fragments we search against the hash table. ovlRefLibrary=1-1 >> translates to 'starting at library 1 and ending at library 1'. Overlaps >> well be computed between library 1 and 2, but not in the same library. >> >> I should point out that this isn't implemented perfectly. The overlap jobs >> for computing overlaps within library 1 are still launched, and the hash >> tables are still built, but no overlaps are output. The 'overlap_partition' >> command is responsible for setting up the hash and reference ranges for each >> overlap job, and this isn't aware of the ovlHashLibrary/ovlRefLibrary >> options. >> >> We've been recently disabling OBT (and fragmentCorrection) in runCA, and >> doing all trimming/correction outside the assembler. In your case, you can >> run the assembler up through OBT on all your 454 reads, then dump gatekeeper >> to build a trimmed fragment set. If you're using CVS tip, dumping as fastq >> will work too. With the pacbio reads, this is mandatory, since the pipeline >> will split some of the pacbio reads into multiple pieces. > > I saw some submissions to the CVS about this, but couldn't figure out > exactly what it meant. This clears up that. I recently started an > assembly with 454 and Illumina reads (Illumina corrected Quake), and > correct-frags have run for several days now. > > Should I run OBT on all my 454 reads, dump the trimmed reads, and use > them in a new assembly with the error corrected Illumina reads? The > default with the CVS tip will then be to not run correct-frags etc on > those reads? What will be the effect of using these trimmed 454 reads > for PacBio error correction? If you have trimmed / corrected reads then disabling both OBT and the correction should be done: doOBT=0 doFragmentCorrection=0 The correction process hasn't changed since the Sanger-only days. It doesn't seem to scale easily to hundreds of millions of reads. The algorithm: In the first pass (fragment correction) a multiple sequence alignment is generated for each read. The alignment is formed from all overlaps to the read. Errors were detected, and noted. In the second pass (overlap correction) these corrections were applied to change the error rate of overlaps. The bases in the read never change. My opinion is that correction of the bases in the reads is now good enough that the reads should be corrected before assembly. The corrections can be specific to the technology (homopolymer for 454, no indel for Illumina) something that both isn't done and would be tough to do in CA. >> The obt overlaps and ovl overlaps used for assembly aren't compatible. The >> obt overlaps are more like blast matches (align a-b in read 1 to c-d in read >> 2) while the ovl overlaps are ... overlaps; see >> http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Overlaps >> . Since trimming will change the length of the read, it's impossible to >> translate the overlaps on untrimmed reads to overlaps on trimmed reads. > > I hadn't seen that page. It's a useful reference (as are other > "hidden" pages at that wiki.) Thought we had a (one) link to it somewhere. *sigh* b > > Ole > >> >> b >> >> >> >> On 5/10/12 4:53 AM, "Ole Kristian Tørresen" <o.k...@bi...> wrote: >> >>> Hi, >>> we have started doing some sequencing on PacBio, and correcting the >>> reads with the PacBioToCA pipeline. The genome about 800 Mb, and we're >>> trying to correct the PacBio reads from two SMRTcells with about 20x >>> in 454 reads. This translates to 130,389 PacBio reads with 126 Mb >>> sequence, and 47M 454 reads and 17.6 Gb sequence. >>> >>> We see that 0-overlaptrim-overlap uses quite a bit of time, and I fear >>> that 1-overlapper will use a long time too. Is it possible to compute >>> the overlaps between the 454 reads ahead of time, and use the overlaps >>> from that store to only compute the overlaps between 454 reads and >>> PacBio reads? Since I guess most to time is spent computing the >>> overlaps between 454 reads. This could be useful for assembly in >>> general too, sometimes we only input some data to have a faster >>> assembly, while later on we input more. >>> >>> When I look at the command that's used to run CA in the error >>> correction step: runCA -s pacbio.spec -p asm -d temppacbio >>> ovlHashLibrary=2 ovlRefLibrary=1-1 obtHashLibrary=1-1 >>> obtRefLibrary=1-1 sge=" -sync y" sgePropagateHold=corAsm >>> stopAfter=overlapper, does it actually do something what I ask for? It >>> only loads hash fragments from library 2, but it loads all libraries >>> in the other *Library options (1-1 = 0)? Could anyone explain to me >>> what that really means? >>> >>> Sincerely, >>> Ole >>> >>> ---------------------------------------------------------------------------- >>> -- >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> wgs-assembler-users mailing list >>> wgs...@li... >>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users >> |