From: Brian P. <bri...@in...> - 2006-09-19 23:31:56
|
When we developed the mzXML format we went through the same questions. This is how I understood things: Readability: We as developers are an unusual use case. The more likely use case for these formats is in visualization or automated processing, neither of which require direct eyeballing of the peak lists under normal circumstances. Or at least that's how we saw it. If you do really need to eyeball the peak lists there are lots of tools available that will do the translation for you. Accuracy: Mass spec data in its raw form is generally stored in binary formats, since mass specs are front ended by binary computers. Conversion to and from base 10 human readable representations introduces error. It's best to hold the data at its original precision and translate out to human readable format at whatever precision is deemed useful for eyeballing. File size: Sure, you can make files smaller by throwing away precision, but as you begin to desire higher precision base64 quickly becomes much more efficient. An excellent way to reduce file size is to compress the peaklists before base64'ing them, as is done in mzXML 3.0, and you do not sacrifice precision. Potential loss of precision information: That information wasn't ever there, really. Again, mass specs are front ended by binary computers, so that base 10 precision information (does '12345.099923123' mean '12345.1' or '12345.10' or'12345.100'?) wasn't ever in the datastream in the first place. The mass spec just wrote a bunch of 32 or 64 bit binary numbers to the best of its (base 2) ability. Looking at the bit patterns would be more revealing of the precision, and base64 preserves them. As a developer, you should be pleased that you don't have to wonder how many digits of that value are for real and not just an artifact of the base 2 to base 10 formatting conversion - with base64 binary values you're working with the original raw data, so those artifacts aren't an issue. Hope this helps, Brian Pratt www.insilicos.com/IPP > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Coleman, Michael > Sent: Tuesday, September 19, 2006 3:58 PM > To: Angel Pizarro; psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > > From: Angel Pizarro > > > > 1. Loss of readability. ... > > > There actually is a space for "human readable spectra" in the > > mzData format, > > I'm glad to hear that. I looked for this, but I did not see it in the > spec here > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > I was looking for something like a 'mzArray' and 'intenArray' tags, > which would be the textual alternatives to 'mzArrayBinary' and > 'intenArrayBinary'. Can you point me to an example? > > > > but really who reads individual mz and intensity values? > > Well--I do. As a programmer I don't think it's an exaggeration to say > that I'm looking at the peak lists in our ms2 files every day. I find > being able to see at a glance that the peaks are basically sane, and > their gross attributes (precision, count, etc.) very useful. > > Of course, as a programmer I can easy whip up a script to decode this > file format. I suspect most users would be stymied, though, > and I think > that that would be unfortunate. Since these files are part of a chain > of scientific argument, I think that as much as possible they ought to > be transparent and as open as possible to verification by > eyeball (mine > and those of our scientists) and alternative pieces of software. > > I'm not saying that this transparency is an absolute good. Perhaps it > is worth impairing so that we can have X, Y, and Z, which are > considered > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > 2. Increased file size. ... > > > Not a fair comparison. Most of the space in an mzData file is > > actually taken up by the human-readable parameters and parameter > > values of the spectra. > > Sorry, I should have been clearer. The numbers I gave were > just for the > peak lists (base64 vs text) and nothing else--no tags, no other > metadata. The rest of the mzData fields would add more > overhead, but I > have no objection about that part. > > If we implemented mzData here today, our files would be bigger if we > used the base64 encoding than if we used the textual numbers (as they > are in our ms2 files). > > > > > 3. Potential loss of precision information. ... > > > Actually the situtation may be reversed. Thermofinnigan, for > > example, stores measured values coming off of the instrument > > as double precision floats, later formatting the numbers as > > needed with respect to the specific instruments limit of detection. > > 12345.1 may have originally been 12345.099923123 in the vendors > > proprietary format. > > Okay, but isn't '12345.1' what I really want to see in this case > (assuming that the vendor is correct about the instrument's accuracy)? > For this particular instance, the string '12345.1' tells me > what I need > to know, and a double-precision floating point value (e.g., > 12345.10000000000036379) would sort of let me guess it (since > double-precision has significantly more significant figures). But a > single-precision value would leave me in a sort of gray area. > That is, > does '12345.099923123' mean '12345.1' or '12345.10' or > '12345.100', for > example? > > > > I wrote an email a few days ago showing how to translate in ruby > > the base64 arrays > > I saw it and it was quite useful to me. Part of the reason I'm asking > these questions is that I noticed in your examples that the > base64-encoded values actually took more space than the original data. > > Just to reiterate my main question, it looks like using > base64 will make > mzData less usable and more complex, as compared to straight > text. What > benefits come with it that offset these drawbacks? > > Mike > > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |