[wgs-assembler-users] Differences in computed genome size depending on whether bog or bogart was us

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
sorry if this e-mail might be a bit long, but it's a strange and
annoying problem.

I have both 454 and Illumina dataset for my species, and I've finished
two pure 454 assemblies with CA and one with mixed data. The genome
should be around 830 Mb, and I think that if computeCoverageStat is
very wrong its estimate, my assembly get screwed up. On the first 454
assembly I ran (there was a some Sanger reads in it too, but not
much), it estimated the genome size to 816,110,291.72 bp, quite close
what we think it is. I ran it with mostly default settings and with
bog. Then I ran a mixed assembly (about 24x 454 reads and 20x Illumina
reads), with bogart, and the estimated genome size was
1,213,867,868.06 bp and I got a lot degenerates and a messed up
assembly. Mostly the same settings as the first assembly, with regards
to error rates at the different stages at least. The big difference in
the 454 reads in this and the first one was that I removed all the 454
shotgun reads that were shorter than 300 bp, that might have done some
harm too.

We have speculated a lot what might have been the cause of the
misestimate, Jason suggested it might be pile up of Illumina reads at
the end of 454 reads. The genome is quite plagued with serial tandem
repeats (ACACACACACA) and the 454 platform can't sequence through this
so a lot of the reads end with the STR. Illumina can sequence through
it, and the guess was that a lot of Illumina reads were just STRs and
they piled up on the end of 454 reads/unitigs, thereby causing the
misestimate. I've tried to look into it, but I can't find that this
holds true. The highest coverage as I can see is in the middle of the
unitigs/degenerates.

Then I created some 454 reads that I had run Overlap Based Trimming on
and saved them for use in later assemblies (as suggested by Brian and
the preprosessing site on the wiki). I ran this assembly with bogart,
because I wanted to have a baseline against later (mixed) assemblies.
All 454 reads that survived OBT was included here. computeCoverageStat
estimated the genome size to 1,118,909,921.55 bp, quite close the
mixed assembly. I then copied the assembly, removed the tigStore,
4-unitigger and later folders, and reran with bog. Then the estimated
genome size was 954,398,150.47 bp. This assembly is scaffolding as I
write this, so I'm not yet sure how it will be. Hopefully it will be
pretty good.

I don't remember the differences between bog and bogart right now, but
can it be understandable that bogart does a bad job on a (mostly) 454
assembly? I've just started an assembly with about 26x 454 reads and
52x Illumina reads, where all reads have been merTrimmed with k-mers
from about 20x coverage in combined overlapping (with FLASH) Illumina
reads as evidence. If there's something with the 454 reads that
confused bogart, then the merTrimmed 454 reads and the predominance of
Illumina reads will hopefully overcome it.

I have most data and logs available, so any hints to what I could do
to fix it or where I should look to figure it out is welcome.

Thank you.

Ole

[wgs-assembler-users] Differences in computed genome size depending on whether bog or bogart was us

[wgs-assembler-users] Differences in computed genome size depending on whether bog or bogart was used