From: Angel P. <an...@ma...> - 2006-09-20 13:27:12
|
On Wednesday 20 September 2006 07:53, Steve Stein wrote: > All, > > I also have the concerns expressed by Michael - transparency is important > to us, but precision even more so. We have long stored our data in ASCII to > avoid the problem, even though some judgement is sometimes necessary. As we > know 1.0000 and 0.9999 are very different things, usually the former is > really meant to be an integer. Also, abundances, since derived from ion > counts, are 'naturally' integral, as m/z values are real - of course data > systems need not conform to nature. I have dealt with MS formats where > everything is, in effect, integral. > > In our library, for example, we want the users to see the values that we > put there, so we use ASCII. It would be very desirable for us if the same > were offered in the XML's - otherwise we will have to go non-standard. > > Perhaps the ultimate answer is some way of associating uncertainty with > values, but I suppose this is a long way off. > Hmmm...... well the XML schema base64binary type can encode integer arrays, but in mzData 1.05 we have defined the arrays as floats in the specification, but not the schema, hence this is not actually enforced. One could encode of the intenBinaryArray data as ints, but it would still be a non-standard usage. It would be better to supply the integer intensity in the supDataBInarrayArray and describe the array in the supDataDesc tag. So what I am getting at is that your use case is handled by mzData, but it the consumer of the data would have to know that to use the supplementary data arrays as the intensity values. Note that you would still have to specify the intensity values in the intenArrayBinary as floats, since this is a requirement of the schema. angel > -Steve Stein > > p.s. (this is NOT NIST speaking, just one of its employees). > > At 9/19/2006 07:56 PM Tuesday, Brian Pratt wrote: > >Oh, and I forgot one extremely important thing: performance. It's > >expensive converting those base 10 representations back to base 2 > >for number crunching, visualization etc. It's much cheaper to read them > >directly as binary, even with the overhead of base64 > >decoding. > > > >Brian Pratt > >www.insilicos.com/IPP > > > > > -----Original Message----- > > > From: psi...@li... > > > [mailto:psi...@li...] On > > > Behalf Of Brian Pratt > > > Sent: Tuesday, September 19, 2006 4:31 PM > > > To: psi...@li... > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > > > > When we developed the mzXML format we went through the same > > > questions. This is how I understood things: > > > > > > Readability: We as developers are an unusual use case. The > > > more likely use case for these formats is in visualization or > > > automated > > > processing, neither of which require direct eyeballing of the > > > peak lists under normal circumstances. Or at least that's how we saw > > > it. If you do really need to eyeball the peak lists there > > > are lots of tools available that will do the translation for you. > > > > > > Accuracy: Mass spec data in its raw form is generally stored > > > in binary formats, since mass specs are front ended by binary > > > computers. Conversion to and from base 10 human readable > > > representations introduces error. It's best to hold the data at its > > > original precision and translate out to human readable format > > > at whatever precision is deemed useful for eyeballing. > > > > > > File size: Sure, you can make files smaller by throwing away > > > precision, but as you begin to desire higher precision base64 quickly > > > becomes much more efficient. An excellent way to reduce file > > > size is to compress the peaklists before base64'ing them, as is done > > > in mzXML 3.0, and you do not sacrifice precision. > > > > > > Potential loss of precision information: That information > > > wasn't ever there, really. Again, mass specs are front ended > > > by binary > > > computers, so that base 10 precision information (does > > > '12345.099923123' mean '12345.1' or '12345.10' > > > or'12345.100'?) wasn't ever in > > > the datastream in the first place. The mass spec just wrote > > > a bunch of 32 or 64 bit binary numbers to the best of its (base 2) > > > ability. Looking at the bit patterns would be more revealing > > > of the precision, and base64 preserves them. As a developer, you > > > should be pleased that you don't have to wonder how many > > > digits of that value are for real and not just an artifact of > > > the base 2 to > > > base 10 formatting conversion - with base64 binary values > > > you're working with the original raw data, so those artifacts > > > aren't an > > > issue. > > > > > > Hope this helps, > > > > > > Brian Pratt > > > www.insilicos.com/IPP > > > > > > > -----Original Message----- > > > > From: psi...@li... > > > > [mailto:psi...@li...] On > > > > Behalf Of Coleman, Michael > > > > Sent: Tuesday, September 19, 2006 3:58 PM > > > > To: Angel Pizarro; psi...@li... > > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > > > > From: Angel Pizarro > > > > > > > > > > > 1. Loss of readability. ... > > > > > > > > > > There actually is a space for "human readable spectra" in the > > > > > mzData format, > > > > > > > > I'm glad to hear that. I looked for this, but I did not > > > > > > see it in the > > > > > > > spec here > > > > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > > > > > > I was looking for something like a 'mzArray' and 'intenArray' tags, > > > > which would be the textual alternatives to 'mzArrayBinary' and > > > > 'intenArrayBinary'. Can you point me to an example? > > > > > > > > > but really who reads individual mz and intensity values? > > > > > > > > Well--I do. As a programmer I don't think it's an > > > > > > exaggeration to say > > > > > > > that I'm looking at the peak lists in our ms2 files every > > > > > > day. I find > > > > > > > being able to see at a glance that the peaks are basically sane, and > > > > their gross attributes (precision, count, etc.) very useful. > > > > > > > > Of course, as a programmer I can easy whip up a script to > > > > > > decode this > > > > > > > file format. I suspect most users would be stymied, though, > > > > and I think > > > > that that would be unfortunate. Since these files are part > > > > > > of a chain > > > > > > > of scientific argument, I think that as much as possible > > > > > > they ought to > > > > > > > be transparent and as open as possible to verification by > > > > eyeball (mine > > > > and those of our scientists) and alternative pieces of software. > > > > > > > > I'm not saying that this transparency is an absolute good. > > > > > > Perhaps it > > > > > > > is worth impairing so that we can have X, Y, and Z, which are > > > > considered > > > > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > > > > > > 2. Increased file size. ... > > > > > > > > > > Not a fair comparison. Most of the space in an mzData file is > > > > > actually taken up by the human-readable parameters and parameter > > > > > values of the spectra. > > > > > > > > Sorry, I should have been clearer. The numbers I gave were > > > > just for the > > > > peak lists (base64 vs text) and nothing else--no tags, no other > > > > metadata. The rest of the mzData fields would add more > > > > overhead, but I > > > > have no objection about that part. > > > > > > > > If we implemented mzData here today, our files would be bigger if we > > > > used the base64 encoding than if we used the textual > > > > > > numbers (as they > > > > > > > are in our ms2 files). > > > > > > > > > > 3. Potential loss of precision information. ... > > > > > > > > > > Actually the situtation may be reversed. Thermofinnigan, for > > > > > example, stores measured values coming off of the instrument > > > > > as double precision floats, later formatting the numbers as > > > > > needed with respect to the specific instruments limit of > > > > > > detection. > > > > > > > > 12345.1 may have originally been 12345.099923123 in the vendors > > > > > proprietary format. > > > > > > > > Okay, but isn't '12345.1' what I really want to see in this case > > > > (assuming that the vendor is correct about the instrument's > > > > > > accuracy)? > > > > > > > For this particular instance, the string '12345.1' tells me > > > > what I need > > > > to know, and a double-precision floating point value (e.g., > > > > 12345.10000000000036379) would sort of let me guess it (since > > > > double-precision has significantly more significant figures). But a > > > > single-precision value would leave me in a sort of gray area. > > > > That is, > > > > does '12345.099923123' mean '12345.1' or '12345.10' or > > > > '12345.100', for > > > > example? > > > > > > > > > I wrote an email a few days ago showing how to translate in ruby > > > > > the base64 arrays > > > > > > > > I saw it and it was quite useful to me. Part of the reason > > > > > > I'm asking > > > > > > > these questions is that I noticed in your examples that the > > > > base64-encoded values actually took more space than the > > > > > > original data. > > > > > > > Just to reiterate my main question, it looks like using > > > > base64 will make > > > > mzData less usable and more complex, as compared to straight > > > > text. What > > > > benefits come with it that offset these drawbacks? > > > > > > > > Mike > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > > > ----------- > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > > Join SourceForge.net's Techsay panel and you'll get the > > > > chance to share your > > > > opinions on IT & business topics through brief surveys -- and > > > > earn cash > > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > > > &CID=DEVDEV > > > > _______________________________________________ > > > > Psidev-ms-dev mailing list > > > > Psi...@li... > > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > > &CID=DEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > >------------------------------------------------------------------------- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share > > your opinions on IT & business topics through brief surveys -- and earn > > cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > >Psidev-ms-dev mailing list > >Psi...@li... > >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... |