From: Brian W. <th...@gm...> - 2014-07-25 23:31:39
|
Hi, Heiner- Wow, you've got an old version. ;-) Those two options don't exist in the latest code. 'rebuild repeats' would take all the reads detected by bogart as being repetitive, and do a second unitigging using just those reads. The idea was that maybe we could collapse/separate repeats better if all the unique reads were removed. I never saw any huge gains from doing this. 'mate extension' was a similar idea. Find all the reads that are in repeats. Then, for each unitig, reconstruct it using the reads in the unitig PLUS any mated reads in the repeats. The end result was that the unitig should be extended into repeats, but only using mated reads. Similar result - kind of worked, but nothing spectacular. They were both decent ideas (and fun to remember), but I don't think they'll help here. We all (should) know that repeats bigger than a read can't be resolved (in general). A corollary of this is that if repeats bigger than the smaller reads are resolved, then the smaller reads cannot be uniquely resolved. It just took enormously different sizes (4k pacbio and 0.1k illumina) to make this a problem. I've been pleased with ECtools from the Schatz Lab ( http://schatzlab.cshl.edu/data/ectools/). Assemble the Illumina to unitigs, use that to correct the pacbio, then assemble the pacbio. I wasn't so pleased by the effort it took to run it (this was 1/2 a year ago) and it might not scale past 1/2 Gbp. But the assemblies were quite good. b On Fri, Jul 25, 2014 at 12:00 PM, kuhl <ku...@mo...> wrote: > Dear Brian, > > just a comment, would > > batRebuildRepeats = 1 > batMateExtension = 1 > > help with this issue? I am also running long reads (~4000 bp) with short > reads and found this to be helping with some issues I had with cgw. > Anyway, I never could use the full memory with bogart with these > parameters, because it crashed in step 10. I had to limit bogart to 100Gb > RAM (on 2-3 Gbp vertebrate genomes). And then it worked. The result was > lower N50 unitigs, but this was solved by cgw. Regarding missassemblies in > scaffolds, I also find a lot, which are actually limiting the final N50 and > are forcing me to do a lot of manual final polishing of the assemblies > (splitting / rescaffolding / gap closing again etc). If I set > "doUnitigSplitting = 1" it helps, but is there any way to speed this up, > like doing the unitig splitting on partitions in parallel? Seems there is > still no perfect solution for hybrid data assemblies.... > > Heiner > > |