Whole-Genome Shotgun Assembler / Bugs / #208 Poor genome size estimate leads to degenerates

#208 Poor genome size estimate leads to degenerates

Milestone: unitigger

Status: open

Owner: Jason Miller

Labels: Invalid Result (29)

Priority: 5

Updated: 2012-07-18

Created: 2012-07-18

Creator: Jason Miller

Private: No

In CA7, the computeCoverageStat utility runs in the 5-consensus-coverage-stat directory. This utility estimates read arrival rate along the genome. From that plus number of reads, it estimates genome size. Using that, it calculates the A-stat for every unitig. A positive value says unitig appears unique based on read coverage. A negative value indicates a collapsed repeat, predisposing the unitig to become a surrogate (multiply placed in scaffolds) or degenerate (left out of scaffolds). Users have reported high rates of degenerates that include long unitigs with very negative A-stats. We suspect the genome size estimator needs improvement for NGS data.

Discussion

Jason Miller - 2012-07-18

We have commit a new version of computeCoverageStat. This probably does not solve the problem but it may be good first steps. First, it will use the runCA paramter utgGenomeSize, if set. The code in CVS before that seemed to ignore the parameter. Second, it will compute the unitig N50 and base its calculation on just those unitigs that are N50 or longer. This set of large unitigs is less likely to contain high-coverage repeats than the larger set.

The existing code attempted to use only large unitigs. It explicitly required that half the unitig span be in unitigs > 10K. That threshold was probably appropriate for Sanger sequencing but not NGS. When the threshold was not met, the code abandoned the large-unitig formula and fell back on the estimate based on all reads and all unitigs. The new N50 rule is an attempt to provide some middle ground.

We can see that more work is required. The N50 formula seems to underestimate genome size when using long reads and overestimate genome size when using short reads. Until we can improve this, users should check the computed genome size in their 5-consensus-coverage-stat directory and consider using the runCA utgGenomeSize parameter.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Poor genome size estimate leads to degenerates

Group

Searches

Help

#208 Poor genome size estimate leads to degenerates

Discussion