From: Brian W. <th...@gm...> - 2015-04-29 21:20:03
|
The version used in the assembler access sequence in the gkpStore directly, so it skips the index building. Plus, the assembler does completely ignore the name, getting pairing from ordering in the fastq. If you've got the memory for it, jellyfish is great. If it doesn't have enough memory, it seems to be slower than meryl. They both end up writing intermediate files to disk, and merging at the end. I wonder if Illumina (the company) owns stock in Seagate and/or Western Digital. On Wed, Apr 29, 2015 at 4:57 PM, mathog <ma...@ca...> wrote: > On 29-Apr-2015 13:28, Brian Walenz wrote: > >> I think this is caused by having more than 2 GBytes of combined sequence >> ident lines. The fasta/fastq reader was written to allow random access to >> any sequence, and to allow it by the name of the sequence. When it was >> written, "big" was human dbEST which contained ~4 GB sequence IIRC. I >> don't miss the computers from back then, but I do miss the data sizes... >> > > The header lines are not very large, they all look like this: > > @HISEQ:348:H2YWCBCXX:1:1101:1057:2031 1:Y:0: > > The file is 217610498050 bytes, the reads are 150bp and that line is 44 > characters, so the number of reads is about: > > 217610498050/(44+1+150+1+1+1+150+1) > 623525782 > giving ~total header length of: > 623525782*45 = ~ 28Gb. > > which is just a wee bit bigger than 2Gb! > > Isn't this bug going to cause meryl to blow up with this data in some run > modes of the assembler? In the assembler one could not (completely) remove > the sequence names or the assembler could never find the pairs. About half > of that string is common in all ident strings, but removing it would only > reduce the total to 14 Gb of ident strings, and the issue would remain. > > In any case, for now I worked around this by using jellyfish instead or > meryl, like: > > jellyfish count -m 17 -C -s 800000000 -t 44 15659_all.fastq > jellyfish histo -t 44 mer_counts.jf >mer_counts.histo > > The first took a bit under 14 minutes and the second just under 5 > minutes. Jellyfish didn't give any warnings or errors when it ran. > > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > |