From: Ole K. T. <o.k...@bi...> - 2012-07-12 18:07:50
|
Hi, sorry if this e-mail might be a bit long, but it's a strange and annoying problem. I have both 454 and Illumina dataset for my species, and I've finished two pure 454 assemblies with CA and one with mixed data. The genome should be around 830 Mb, and I think that if computeCoverageStat is very wrong its estimate, my assembly get screwed up. On the first 454 assembly I ran (there was a some Sanger reads in it too, but not much), it estimated the genome size to 816,110,291.72 bp, quite close what we think it is. I ran it with mostly default settings and with bog. Then I ran a mixed assembly (about 24x 454 reads and 20x Illumina reads), with bogart, and the estimated genome size was 1,213,867,868.06 bp and I got a lot degenerates and a messed up assembly. Mostly the same settings as the first assembly, with regards to error rates at the different stages at least. The big difference in the 454 reads in this and the first one was that I removed all the 454 shotgun reads that were shorter than 300 bp, that might have done some harm too. We have speculated a lot what might have been the cause of the misestimate, Jason suggested it might be pile up of Illumina reads at the end of 454 reads. The genome is quite plagued with serial tandem repeats (ACACACACACA) and the 454 platform can't sequence through this so a lot of the reads end with the STR. Illumina can sequence through it, and the guess was that a lot of Illumina reads were just STRs and they piled up on the end of 454 reads/unitigs, thereby causing the misestimate. I've tried to look into it, but I can't find that this holds true. The highest coverage as I can see is in the middle of the unitigs/degenerates. Then I created some 454 reads that I had run Overlap Based Trimming on and saved them for use in later assemblies (as suggested by Brian and the preprosessing site on the wiki). I ran this assembly with bogart, because I wanted to have a baseline against later (mixed) assemblies. All 454 reads that survived OBT was included here. computeCoverageStat estimated the genome size to 1,118,909,921.55 bp, quite close the mixed assembly. I then copied the assembly, removed the tigStore, 4-unitigger and later folders, and reran with bog. Then the estimated genome size was 954,398,150.47 bp. This assembly is scaffolding as I write this, so I'm not yet sure how it will be. Hopefully it will be pretty good. I don't remember the differences between bog and bogart right now, but can it be understandable that bogart does a bad job on a (mostly) 454 assembly? I've just started an assembly with about 26x 454 reads and 52x Illumina reads, where all reads have been merTrimmed with k-mers from about 20x coverage in combined overlapping (with FLASH) Illumina reads as evidence. If there's something with the 454 reads that confused bogart, then the merTrimmed 454 reads and the predominance of Illumina reads will hopefully overcome it. I have most data and logs available, so any hints to what I could do to fix it or where I should look to figure it out is welcome. Thank you. Ole |