Re: [wgs-assembler-users] k-mer frequency threshold of OVL and OBT

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi-

The threshold is plausible, but on the high side.

In 0-mercounts, you can generate a histogram of mer counts using: meryl –Dh –s FILE, where FILE is the prefix of FILE.mcdat and FILE.mcidx.  From this histogram, you want to pick a value that captures most of the hump at your expected coverage, but isn’t so high that repeats will dominate.  The assembler tries to do this using the last two columns in the histogram output (via a method that I struggle to explain with a white board and hand waving, so won’t attempt to do it here).  The method does break down if there is no hump at expected coverage.

Columns:

frequency
count
fraction distinct
fraction total

An alternate way to pick the threshold is to argue that if you have 8x coverage, setting the threshold to 800 will capture overlaps for a 100 copy repeat.

Hope this helped.

If you want to send the histogram output, I can take a look at it.

b

On 3/14/13 9:26 AM, "wcyer" <wcy...@16...> wrote:

Hi, everbody,
I am using CABOG to assemble a genome about 2.5Gb . Before run real data, I did some simulated assembly (about 8X data) and the threshold of OBT and OVL is from 170 to 460. But when I run CABOG on the real data, the threshold is 5633:
Reset OBT mer threshold from auto to 5633.
Reset OVL mer threshold from auto to 5633.

I see the default setting of the treshold is 'auto'. How CABOG determine this threshold? Is this number normal?

Thanks in advance.

Gorliver