Jason Miller - 2012-07-18

We have commit a new version of computeCoverageStat. This probably does not solve the problem but it may be good first steps. First, it will use the runCA paramter utgGenomeSize, if set. The code in CVS before that seemed to ignore the parameter. Second, it will compute the unitig N50 and base its calculation on just those unitigs that are N50 or longer. This set of large unitigs is less likely to contain high-coverage repeats than the larger set.

The existing code attempted to use only large unitigs. It explicitly required that half the unitig span be in unitigs > 10K. That threshold was probably appropriate for Sanger sequencing but not NGS. When the threshold was not met, the code abandoned the large-unitig formula and fell back on the estimate based on all reads and all unitigs. The new N50 rule is an attempt to provide some middle ground.

We can see that more work is required. The N50 formula seems to underestimate genome size when using long reads and overestimate genome size when using short reads. Until we can improve this, users should check the computed genome size in their 5-consensus-coverage-stat directory and consider using the runCA utgGenomeSize parameter.