Re: [wgs-assembler-users] meryl crashing on large fastq input

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 29-Apr-2015 13:28, Brian Walenz wrote:
> I think this is caused by having more than 2 GBytes of combined 
> sequence
> ident lines.  The fasta/fastq reader was written to allow random access 
> to
> any sequence, and to allow it by the name of the sequence.  When it was
> written, "big" was human dbEST which contained ~4 GB sequence IIRC.  I
> don't miss the computers from back then, but I do miss the data 
> sizes...

The header lines are not very large, they all look like this:

@HISEQ:348:H2YWCBCXX:1:1101:1057:2031 1:Y:0:

The file is 217610498050 bytes, the reads are 150bp and that line is 44 
characters, so the number of reads is about:

217610498050/(44+1+150+1+1+1+150+1)
623525782
giving ~total header length of:
623525782*45 = ~ 28Gb.

which is just a wee bit bigger than 2Gb!

Isn't this bug going to cause meryl to blow up with this data in some 
run modes of the assembler?  In the assembler one could not (completely) 
remove the sequence names or the assembler could never find the pairs.  
About half of that string is common in all ident strings, but removing 
it would only reduce the total to 14 Gb of ident strings, and the issue 
would remain.

In any case, for now I worked around this by using jellyfish instead or 
meryl, like:

jellyfish count -m 17 -C -s 800000000 -t 44 15659_all.fastq
jellyfish histo -t 44 mer_counts.jf >mer_counts.histo

The first took a bit under 14 minutes and the second just under 5 
minutes.  Jellyfish didn't give any warnings or errors when it ran.

Thanks,

David Mathog
ma...@ca...
Manager, Sequence Analysis Facility, Biology Division, Caltech