From: Fabien C. <fa...@ca...> - 2012-06-30 18:07:56
|
On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <lh...@sa...> wrote: > > On Jun 29, 2012, at 8:47 AM, Fabien Campagne wrote: > > Hello James, > > Thanks for your comments. I'll try to answer the various points you noted: > > 1. Compression/decompression speed. You note that Goby is in the ballpark, > but I would like to note that what yourmeasure includes > a. conversion to BAM model > b. compression > and similarly, when you write back BAM from Goby: > c. decompression, > d conversion from Goby encoding to BAM. > > If you wrote an alignment directly from an aligner with the Goby > representation, you would not incur a. When you work directly with Goby > alignments, you do not incur d. The cost of a or d. turns out to dominate > the cost of b and c in the conversions from/to BAM because Goby represents > alignments slightly differently from BAM (we think the method we use is > simpler and more extensible). > > If you were to measure compression only (you can do this with the > concatenate alignment that will happily recompress an existing alignment > with a different codec), or measure decompression only (e.g., timing the > compact-file-stats mode that decompresses every entry of an alignement), > you would probably find that Goby compression/decompression is much closer > to samtools. > > > We are not interested in decompression alone. What we care more in > practice is decoding, i.e. decompressing data and then representing the > alignment in a data structure ready for use by other APIs. I am assuming > that once goby decodes an alignment, it will take similar amount of time, > in comparison to Picard, to write the alignment in the SAM format. > Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality. We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby. Decoding performance can be measured for a Goby alignment with the command: goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API] You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application. As with any new tool, it may be wise to the read documentation and when in doubt ask questions. We'll be happy to offer additional clarifications at the Goby forum user group. You can look it up here<https://groups.google.com/forum/?fromgroups#!forum/goby-framework>or email directly: gob...@go... We look forward to additional discussions with members of the SAM/BAM developer community. Best, Fabien > On Fri, Jun 29, 2012 at 7:40 AM, James Bonfield <jk...@sa...> wrote: > >> On Fri, Jun 29, 2012 at 09:56:45AM +0100, James Bonfield wrote: >> > However it is indeed very slow compared to other alternatives. I've >> >> I take that back now - it was partly my input data. On a more sensible >> set it operates reasonably. >> >> A quick test on the very shallow small test set from SeqSqueeze; about >> 300,000 reads aligned against the human genome: >> >> Prog Size C.Time >> -------------------------------------- >> samtools 28535830 6.2s >> fqzcomp (low) 15682012 1.6s >> fqzcomp (high) 15282395 2.8s >> samcomp1 16222671 5.6s >> samcomp1 -r 9743923 6.8s >> goby 12742632[1] 22.8s >> goby -g 12742632[1] 18.8s >> CRAM 11152360[2] 41.2s >> >> [1] Lost 4.2% of the data, unmapped reads? >> [2] No read names >> >> So there are a few oddities. Samtools is artificially high here as it >> includes the auxillary fields which other programs are not storing >> either because they can't (fqzcomp, samcomp) or have been told not >> to. >> >> fqzcomp is just a fastq compressor, so it stores even less. It shows >> though the raw name, seq, qual size we can get. >> >> samcomp1 with and without a reference shows a substantial variation in >> size, as expected. CRAM is somewhere between the two in ratio (and >> excludes names, which took up about 810k in samcomp1). Goby is doing >> great without a reference and bizarrely making no difference with one. >> >> I must be doing something wrong. It's the same fasta file I supplied >> CRAM and samcomp1 with though so I'm sure it's correct. However it >> just seems to have no impact on the result. >> >> Speed wise Goby is faster than CRAM here. Maybe the extreme low >> coverage is being unfair as it perhaps is testing the time to load the >> reference more than to load the data. >> >> Anyway, it's in the right ballpark. >> >> James >> >> -- >> James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia >> Tova >> | Plurima gyrabant gymbolitare vabo; >> A Staden Package developer: | Et Borogovorum mimzebant undique >> formae, >> https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Samtools-devel mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> > > > > -- > Fabien Campagne, PhD -- http://campagnelab.org > > Assistant Professor, Dept. of Physiology and Biophysics > Institute for Computational Biomedicine > Associate Director, Biomedical Informatics Core, > Clinical Translational Science Center > > Weill Medical College of Cornell University > phone: (646)-962-5613 1305 York Avenue > fax: (646)-962-0383 Box 140 > New York, NY 10021 > > Do you speak next-gen? > > See how GobyWeb can help simplify your NGS projects at > http://gobyweb.campagnelab.org > > > > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a company > registered in England with number 2742969, whose registered office is 215 > Euston Road, London, NW1 2BE. > -- Fabien Campagne, PhD -- http://campagnelab.org Assistant Professor, Dept. of Physiology and Biophysics Institute for Computational Biomedicine Associate Director, Biomedical Informatics Core, Clinical Translational Science Center Weill Medical College of Cornell University phone: (646)-962-5613 1305 York Avenue fax: (646)-962-0383 Box 140 New York, NY 10021 Do you speak next-gen? See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org |