Re: [wgs-assembler-users] k-mer frequency threshold of OVL and OBT

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,
I assembled the mers with frequency>1600 as you suggested , these sequences are TEs, rRNA, and most of them are highly repetitive.
The auto determined threshold 5633 could represent 72% of total mers. Is that saying, 72% of the genome is expected to be assembled?
Should I set a large threshold capture most of the genome or set a smaller threshold to  capture most of non-repeat sequences of  the genome?

Kindly regards,
Gorliver.

At 2013-03-15 01:40:00,"Walenz, Brian" <bw...@jc...> wrote:
[seems the reply from Gorliver never made it to the list]

The histogram (plot attached) is showing no hump at expected coverage.  This could either be from high error in the reads, polymorphism in the sample, or low coverage causing noise and real sequence to overlap in the histogram.

I’d go with a threshold of 1000, just because it was the historical default.

The histogram is showing signs of lots of repeats.  A threshold of 1519 is representing 99.9% of the distinct kmers, but only 59% of all kmers.  Said another way, 0.1% of the kmers are representing 40% of the bases.  You might want to run this through any of the greedy kmer assemblers to see what this is.

CA has one:

src/AS_MER/gkrpt.pl 200 < mers.fasta

where 200 is the minimum size to report.  mers.fasta can be any of the fasta files in 0-mercounts, or dumped: “meryl –Dt –n CUTOFF –s FILE” will report all kmers with count >= CUTOFF.  For this task, a cutoff of 1519 (from above) or whatever you use for overlaps is appropriate.  You’re just trying to see what sequence will not have overlaps because it is screen by the kmer masking.

b

On 3/14/13 9:45 AM, "Brian Walenz" <bw...@jc...> wrote:

Hi-

The threshold is plausible, but on the high side.

In 0-mercounts, you can generate a histogram of mer counts using: meryl –Dh –s FILE, where FILE is the prefix of FILE.mcdat and FILE.mcidx.  >From this histogram, you want to pick a value that captures most of the hump at your expected coverage, but isn’t so high that repeats will dominate.  The assembler tries to do this using the last two columns in the histogram output (via a method that I struggle to explain with a white board and hand waving, so won’t attempt to do it here).  The method does break down if there is no hump at expected coverage.

Columns:

frequency
count
fraction distinct
fraction total

An alternate way to pick the threshold is to argue that if you have 8x coverage, setting the threshold to 800 will capture overlaps for a 100 copy repeat.

Hope this helped.

If you want to send the histogram output, I can take a look at it.

b

On 3/14/13 9:26 AM, "wcyer" <wcy...@16...> wrote:

Hi, everbody,
I am using CABOG to assemble a genome about 2.5Gb . Before run real data, I did some simulated assembly (about 8X data) and the threshold of OBT and OVL is from 170 to 460. But when I run CABOG on the real data, the threshold is 5633:
Reset OBT mer threshold from auto to 5633.
Reset OVL mer threshold from auto to 5633.

I see the default setting of the treshold is 'auto'. How CABOG determine this threshold? Is this number normal?

Thanks in advance.

Gorliver