|
From: Matthew C. <mat...@va...> - 2007-10-09 18:45:40
|
We are a bit off topic but this is interesting. :) To really assess the
performance issues here you have to dig deeper than just heap
fragmentation though. Assuming a list to store the SpectrumHeaders and
vectors to store ms and intensities, and without preallocation based on
counts, because of the tree-like nature of mzML, you'd end up with a
memory footprint like:
Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header
Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader
SpectrumNMz1...P SpectrumNInten1...P
If you preallocated the SpectrumHeaders in the list based on the count
attribute, you'd instead get a footprint like:
Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P
Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P
So you're going to have a tradeoff of fragmentation either way. The
fragmentation in the first case would be worse for quick sequential
access to each SpectrumHeader, but better for accessing the peaks of a
particular spectrum. The fragmentation in the second case would be
better for quick sequential access to each SpectrumHeader, but worse for
accessing the peaks of a particular spectrum. Access to the peaks could
be further improved by storing the Mz and Inten values together (i.e. in
a struct { float mz, inten; } ). This is all incredibly superfluous
though and I doubt this fragmentation has an appreciable performance
impact on data with any kind of density to it. So if you needed
extremely responsive performance on very sparse spectra, you might think
about this stuff, but most of us are far more limited by the sheer
number of peaks. And if extreme responsiveness is your goal, no
conceivable XML format is going to help you!
-Matt
Brian Pratt wrote:
> Heap fragmentation has a performance cost that persists past the initial
> allocation(s), since it affects further allocations as well. If it can be
> avoided with a relatively simple mechanism like this, that's a good thing.
>
> I started coding in 1977, FWIW. Long enough to learn to prefer the simple
> solution over the one that requires a gestalt...
>
> To be fair, having done this stuff for a long time isn't really a predictor
> of me being any good at it, but I get by OK.
>
> - Brian
>
>
>
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On Behalf Of Mike
> Coleman
> Sent: Tuesday, October 09, 2007 9:21 AM
> To: Mass spectrometry standard development
> Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks
>
> I can see why having a 'count' might make it easier for novice
> programmers to *write* a processing program, but I cannot see why
> having a 'count' would make more than a negligible difference in
> performance, if even that. As a worst case, one could read the mzML
> file into memory, scan it once to calculate the count, and then
> proceed as before. The additional time required to do a sweep through
> RAM would be trivial.
>
> Mike
>
>
>
|