From: mathog <ma...@ca...> - 2015-04-29 20:58:04
|
On 29-Apr-2015 13:28, Brian Walenz wrote: > I think this is caused by having more than 2 GBytes of combined > sequence > ident lines. The fasta/fastq reader was written to allow random access > to > any sequence, and to allow it by the name of the sequence. When it was > written, "big" was human dbEST which contained ~4 GB sequence IIRC. I > don't miss the computers from back then, but I do miss the data > sizes... The header lines are not very large, they all look like this: @HISEQ:348:H2YWCBCXX:1:1101:1057:2031 1:Y:0: The file is 217610498050 bytes, the reads are 150bp and that line is 44 characters, so the number of reads is about: 217610498050/(44+1+150+1+1+1+150+1) 623525782 giving ~total header length of: 623525782*45 = ~ 28Gb. which is just a wee bit bigger than 2Gb! Isn't this bug going to cause meryl to blow up with this data in some run modes of the assembler? In the assembler one could not (completely) remove the sequence names or the assembler could never find the pairs. About half of that string is common in all ident strings, but removing it would only reduce the total to 14 Gb of ident strings, and the issue would remain. In any case, for now I worked around this by using jellyfish instead or meryl, like: jellyfish count -m 17 -C -s 800000000 -t 44 15659_all.fastq jellyfish histo -t 44 mer_counts.jf >mer_counts.histo The first took a bit under 14 minutes and the second just under 5 minutes. Jellyfish didn't give any warnings or errors when it ran. Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |