From: Matthew C. <mat...@va...> - 2007-10-09 18:45:40
|
We are a bit off topic but this is interesting. :) To really assess the performance issues here you have to dig deeper than just heap fragmentation though. Assuming a list to store the SpectrumHeaders and vectors to store ms and intensities, and without preallocation based on counts, because of the tree-like nature of mzML, you'd end up with a memory footprint like: Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader SpectrumNMz1...P SpectrumNInten1...P If you preallocated the SpectrumHeaders in the list based on the count attribute, you'd instead get a footprint like: Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P So you're going to have a tradeoff of fragmentation either way. The fragmentation in the first case would be worse for quick sequential access to each SpectrumHeader, but better for accessing the peaks of a particular spectrum. The fragmentation in the second case would be better for quick sequential access to each SpectrumHeader, but worse for accessing the peaks of a particular spectrum. Access to the peaks could be further improved by storing the Mz and Inten values together (i.e. in a struct { float mz, inten; } ). This is all incredibly superfluous though and I doubt this fragmentation has an appreciable performance impact on data with any kind of density to it. So if you needed extremely responsive performance on very sparse spectra, you might think about this stuff, but most of us are far more limited by the sheer number of peaks. And if extreme responsiveness is your goal, no conceivable XML format is going to help you! -Matt Brian Pratt wrote: > Heap fragmentation has a performance cost that persists past the initial > allocation(s), since it affects further allocations as well. If it can be > avoided with a relatively simple mechanism like this, that's a good thing. > > I started coding in 1977, FWIW. Long enough to learn to prefer the simple > solution over the one that requires a gestalt... > > To be fair, having done this stuff for a long time isn't really a predictor > of me being any good at it, but I get by OK. > > - Brian > > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Tuesday, October 09, 2007 9:21 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks > > I can see why having a 'count' might make it easier for novice > programmers to *write* a processing program, but I cannot see why > having a 'count' would make more than a negligible difference in > performance, if even that. As a worst case, one could read the mzML > file into memory, scan it once to calculate the count, and then > proceed as before. The additional time required to do a sweep through > RAM would be trivial. > > Mike > > > |