Re: [Psidev-ms-dev] mzML 0.99 remarks

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

We are a bit off topic but this is interesting. :)  To really assess the 
performance issues here you have to dig deeper than just heap 
fragmentation though.  Assuming a list to store the SpectrumHeaders and 
vectors to store ms and intensities, and without preallocation based on 
counts, because of the tree-like nature of mzML, you'd end up with a 
memory footprint like:
Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header 
Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader 
SpectrumNMz1...P SpectrumNInten1...P

If you preallocated the SpectrumHeaders in the list based on the count 
attribute, you'd instead get a footprint like:
Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P 
Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P

So you're going to have a tradeoff of fragmentation either way.  The 
fragmentation in the first case would be worse for quick sequential 
access to each SpectrumHeader, but better for accessing the peaks of a 
particular spectrum.  The fragmentation in the second case would be 
better for quick sequential access to each SpectrumHeader, but worse for 
accessing the peaks of a particular spectrum.  Access to the peaks could 
be further improved by storing the Mz and Inten values together (i.e. in 
a struct { float mz, inten; } ).  This is all incredibly superfluous 
though and I doubt this fragmentation has an appreciable performance 
impact on data with any kind of density to it.  So if you needed 
extremely responsive performance on very sparse spectra, you might think 
about this stuff, but most of us are far more limited by the sheer 
number of peaks.  And if extreme responsiveness is your goal, no 
conceivable XML format is going to help you!

-Matt

Brian Pratt wrote:
> Heap fragmentation has a performance cost that persists past the initial
> allocation(s), since it affects further allocations as well.  If it can be
> avoided with a relatively simple mechanism like this, that's a good thing.
>
> I started coding in 1977, FWIW.  Long enough to learn to prefer the simple
> solution over the one that requires a gestalt...
>
> To be fair, having done this stuff for a long time isn't really a predictor
> of me being any good at it, but I get by OK.
>
> - Brian
>
>
>
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On Behalf Of Mike
> Coleman
> Sent: Tuesday, October 09, 2007 9:21 AM
> To: Mass spectrometry standard development
> Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks
>
> I can see why having a 'count' might make it easier for novice
> programmers to *write* a processing program, but I cannot see why
> having a 'count' would make more than a negligible difference in
> performance, if even that.  As a worst case, one could read the mzML
> file into memory, scan it once to calculate the count, and then
> proceed as before.  The additional time required to do a sweep through
> RAM would be trivial.
>
> Mike
>
>   
>