From: Brian P. <bri...@in...> - 2006-09-19 23:56:30
|
Oh, and I forgot one extremely important thing: performance. It's expensive converting those base 10 representations back to base 2 for number crunching, visualization etc. It's much cheaper to read them directly as binary, even with the overhead of base64 decoding. Brian Pratt www.insilicos.com/IPP > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Brian Pratt > Sent: Tuesday, September 19, 2006 4:31 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > > When we developed the mzXML format we went through the same > questions. This is how I understood things: > > Readability: We as developers are an unusual use case. The > more likely use case for these formats is in visualization or > automated > processing, neither of which require direct eyeballing of the > peak lists under normal circumstances. Or at least that's how we saw > it. If you do really need to eyeball the peak lists there > are lots of tools available that will do the translation for you. > > Accuracy: Mass spec data in its raw form is generally stored > in binary formats, since mass specs are front ended by binary > computers. Conversion to and from base 10 human readable > representations introduces error. It's best to hold the data at its > original precision and translate out to human readable format > at whatever precision is deemed useful for eyeballing. > > File size: Sure, you can make files smaller by throwing away > precision, but as you begin to desire higher precision base64 quickly > becomes much more efficient. An excellent way to reduce file > size is to compress the peaklists before base64'ing them, as is done > in mzXML 3.0, and you do not sacrifice precision. > > Potential loss of precision information: That information > wasn't ever there, really. Again, mass specs are front ended > by binary > computers, so that base 10 precision information (does > '12345.099923123' mean '12345.1' or '12345.10' > or'12345.100'?) wasn't ever in > the datastream in the first place. The mass spec just wrote > a bunch of 32 or 64 bit binary numbers to the best of its (base 2) > ability. Looking at the bit patterns would be more revealing > of the precision, and base64 preserves them. As a developer, you > should be pleased that you don't have to wonder how many > digits of that value are for real and not just an artifact of > the base 2 to > base 10 formatting conversion - with base64 binary values > you're working with the original raw data, so those artifacts > aren't an > issue. > > Hope this helps, > > Brian Pratt > www.insilicos.com/IPP > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Coleman, Michael > > Sent: Tuesday, September 19, 2006 3:58 PM > > To: Angel Pizarro; psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > From: Angel Pizarro > > > > > > 1. Loss of readability. ... > > > > > There actually is a space for "human readable spectra" in the > > > mzData format, > > > > I'm glad to hear that. I looked for this, but I did not > see it in the > > spec here > > > > > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > > > I was looking for something like a 'mzArray' and 'intenArray' tags, > > which would be the textual alternatives to 'mzArrayBinary' and > > 'intenArrayBinary'. Can you point me to an example? > > > > > > > but really who reads individual mz and intensity values? > > > > Well--I do. As a programmer I don't think it's an > exaggeration to say > > that I'm looking at the peak lists in our ms2 files every > day. I find > > being able to see at a glance that the peaks are basically sane, and > > their gross attributes (precision, count, etc.) very useful. > > > > Of course, as a programmer I can easy whip up a script to > decode this > > file format. I suspect most users would be stymied, though, > > and I think > > that that would be unfortunate. Since these files are part > of a chain > > of scientific argument, I think that as much as possible > they ought to > > be transparent and as open as possible to verification by > > eyeball (mine > > and those of our scientists) and alternative pieces of software. > > > > I'm not saying that this transparency is an absolute good. > Perhaps it > > is worth impairing so that we can have X, Y, and Z, which are > > considered > > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > > > > 2. Increased file size. ... > > > > > Not a fair comparison. Most of the space in an mzData file is > > > actually taken up by the human-readable parameters and parameter > > > values of the spectra. > > > > Sorry, I should have been clearer. The numbers I gave were > > just for the > > peak lists (base64 vs text) and nothing else--no tags, no other > > metadata. The rest of the mzData fields would add more > > overhead, but I > > have no objection about that part. > > > > If we implemented mzData here today, our files would be bigger if we > > used the base64 encoding than if we used the textual > numbers (as they > > are in our ms2 files). > > > > > > > > 3. Potential loss of precision information. ... > > > > > Actually the situtation may be reversed. Thermofinnigan, for > > > example, stores measured values coming off of the instrument > > > as double precision floats, later formatting the numbers as > > > needed with respect to the specific instruments limit of > detection. > > > 12345.1 may have originally been 12345.099923123 in the vendors > > > proprietary format. > > > > Okay, but isn't '12345.1' what I really want to see in this case > > (assuming that the vendor is correct about the instrument's > accuracy)? > > For this particular instance, the string '12345.1' tells me > > what I need > > to know, and a double-precision floating point value (e.g., > > 12345.10000000000036379) would sort of let me guess it (since > > double-precision has significantly more significant figures). But a > > single-precision value would leave me in a sort of gray area. > > That is, > > does '12345.099923123' mean '12345.1' or '12345.10' or > > '12345.100', for > > example? > > > > > > > I wrote an email a few days ago showing how to translate in ruby > > > the base64 arrays > > > > I saw it and it was quite useful to me. Part of the reason > I'm asking > > these questions is that I noticed in your examples that the > > base64-encoded values actually took more space than the > original data. > > > > Just to reiterate my main question, it looks like using > > base64 will make > > mzData less usable and more complex, as compared to straight > > text. What > > benefits come with it that offset these drawbacks? > > > > Mike > > > > > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |