From: Coleman, M. <MK...@St...> - 2006-09-19 22:58:23
|
> From: Angel Pizarro > > 1. Loss of readability. ... > There actually is a space for "human readable spectra" in the=20 > mzData format,=20 I'm glad to hear that. I looked for this, but I did not see it in the spec here =09 http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData I was looking for something like a 'mzArray' and 'intenArray' tags, which would be the textual alternatives to 'mzArrayBinary' and 'intenArrayBinary'. Can you point me to an example? > but really who reads individual mz and intensity values? Well--I do. As a programmer I don't think it's an exaggeration to say that I'm looking at the peak lists in our ms2 files every day. I find being able to see at a glance that the peaks are basically sane, and their gross attributes (precision, count, etc.) very useful. Of course, as a programmer I can easy whip up a script to decode this file format. I suspect most users would be stymied, though, and I think that that would be unfortunate. Since these files are part of a chain of scientific argument, I think that as much as possible they ought to be transparent and as open as possible to verification by eyeball (mine and those of our scientists) and alternative pieces of software. I'm not saying that this transparency is an absolute good. Perhaps it is worth impairing so that we can have X, Y, and Z, which are considered more valuable. I'm not seeing what X, Y, and Z are, though. > > 2. Increased file size. ... > Not a fair comparison. Most of the space in an mzData file is=20 > actually taken up by the human-readable parameters and parameter=20 > values of the spectra. Sorry, I should have been clearer. The numbers I gave were just for the peak lists (base64 vs text) and nothing else--no tags, no other metadata. The rest of the mzData fields would add more overhead, but I have no objection about that part. If we implemented mzData here today, our files would be bigger if we used the base64 encoding than if we used the textual numbers (as they are in our ms2 files). > > 3. Potential loss of precision information. ... > Actually the situtation may be reversed. Thermofinnigan, for=20 > example, stores measured values coming off of the instrument=20 > as double precision floats, later formatting the numbers as=20 > needed with respect to the specific instruments limit of detection.=20 > 12345.1 may have originally been 12345.099923123 in the vendors=20 > proprietary format. Okay, but isn't '12345.1' what I really want to see in this case (assuming that the vendor is correct about the instrument's accuracy)? For this particular instance, the string '12345.1' tells me what I need to know, and a double-precision floating point value (e.g., 12345.10000000000036379) would sort of let me guess it (since double-precision has significantly more significant figures). But a single-precision value would leave me in a sort of gray area. That is, does '12345.099923123' mean '12345.1' or '12345.10' or '12345.100', for example? > I wrote an email a few days ago showing how to translate in ruby=20 > the base64 arrays I saw it and it was quite useful to me. Part of the reason I'm asking these questions is that I noticed in your examples that the base64-encoded values actually took more space than the original data. Just to reiterate my main question, it looks like using base64 will make mzData less usable and more complex, as compared to straight text. What benefits come with it that offset these drawbacks? Mike |
From: Akhilesh P. <pa...@jh...> - 2006-09-19 23:02:12
|
I agree with Mike about the human readable part and the size issues - I insist in our lab that all files to be manipulated be 'scanned' before 'crunching.' If there are no compelling reasons, I do not see why this should not be reconsidered. Akhilesh Pandey At 05:58 PM 9/19/2006, Coleman, Michael wrote: > > From: Angel Pizarro > > > > 1. Loss of readability. ... > > > There actually is a space for "human readable spectra" in the > > mzData format, > >I'm glad to hear that. I looked for this, but I did not see it in the >spec here > > >http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > >I was looking for something like a 'mzArray' and 'intenArray' tags, >which would be the textual alternatives to 'mzArrayBinary' and >'intenArrayBinary'. Can you point me to an example? > > > > but really who reads individual mz and intensity values? > >Well--I do. As a programmer I don't think it's an exaggeration to say >that I'm looking at the peak lists in our ms2 files every day. I find >being able to see at a glance that the peaks are basically sane, and >their gross attributes (precision, count, etc.) very useful. > >Of course, as a programmer I can easy whip up a script to decode this >file format. I suspect most users would be stymied, though, and I think >that that would be unfortunate. Since these files are part of a chain >of scientific argument, I think that as much as possible they ought to >be transparent and as open as possible to verification by eyeball (mine >and those of our scientists) and alternative pieces of software. > >I'm not saying that this transparency is an absolute good. Perhaps it >is worth impairing so that we can have X, Y, and Z, which are considered >more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > 2. Increased file size. ... > > > Not a fair comparison. Most of the space in an mzData file is > > actually taken up by the human-readable parameters and parameter > > values of the spectra. > >Sorry, I should have been clearer. The numbers I gave were just for the >peak lists (base64 vs text) and nothing else--no tags, no other >metadata. The rest of the mzData fields would add more overhead, but I >have no objection about that part. > >If we implemented mzData here today, our files would be bigger if we >used the base64 encoding than if we used the textual numbers (as they >are in our ms2 files). > > > > > 3. Potential loss of precision information. ... > > > Actually the situtation may be reversed. Thermofinnigan, for > > example, stores measured values coming off of the instrument > > as double precision floats, later formatting the numbers as > > needed with respect to the specific instruments limit of detection. > > 12345.1 may have originally been 12345.099923123 in the vendors > > proprietary format. > >Okay, but isn't '12345.1' what I really want to see in this case >(assuming that the vendor is correct about the instrument's accuracy)? >For this particular instance, the string '12345.1' tells me what I need >to know, and a double-precision floating point value (e.g., >12345.10000000000036379) would sort of let me guess it (since >double-precision has significantly more significant figures). But a >single-precision value would leave me in a sort of gray area. That is, >does '12345.099923123' mean '12345.1' or '12345.10' or '12345.100', for >example? > > > > I wrote an email a few days ago showing how to translate in ruby > > the base64 arrays > >I saw it and it was quite useful to me. Part of the reason I'm asking >these questions is that I noticed in your examples that the >base64-encoded values actually took more space than the original data. > >Just to reiterate my main question, it looks like using base64 will make >mzData less usable and more complex, as compared to straight text. What >benefits come with it that offset these drawbacks? > >Mike > > > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Psidev-ms-dev mailing list >Psi...@li... >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Coleman, M. <MK...@St...> - 2006-09-20 18:17:27
|
> Angel Pizarro: > I am cringing as I write this, since I really think you=20 > should not go this=20 > route, but look at the supplementary data tags. I am cringing with you. :-) Abusing the supplementary tags for this purpose is definitely out--this is an even more unpleasant option than going with a home-grown mzData extension. > ah, yes, but most probably you have to either zcat the file=20 > or unzip it in=20 > order to read the floats, then zip the whole file back again=20 > once finished, a=20 > situation not unlike decoding byte arrays and base64 strings.... Yes, having zipped files does imply having gzip/etc around, and you are correct that this is in some ways similar. A notable difference is that zip tools are already ubiquitous, standard, reliable, and well-understood by users. The scripts I'll have to write to decode mzData won't be. (Note, too, that it is not necessary to unzip and rezip in order to just read a compressed file. The 'zcat' program and its variations (there's surely a ruby module, for instance) can read the file without disturbing it.) > I'll add to those arguments that we should look at the=20 > computational costs of un/zipping whole files as opposed to stream=20 > en/decoding individual mzData spectra. I agree that zip'ing will have a greater cost than generating base64. I don't think the cost is great, and in any case, zip'ing isn't necessary unless you're hurting for disk space. =20 Disk is cheap. If I zip'ed these files, it would be as much to get the checksumming as to save the disk space. > 1) it can handle encoding of integers, single and double precision float=20 > arrays without loss of information As far as I know, a textual representation can also do this perfectly. > 2) comparable compression with zipped plain text of the same precision I agree that they're similar, within the bounds that I care about (2-3x). > 3) better performance with respect to accessing individual spectra vs. > compressed plain text If you mean that you can easily seek to a particular spectrum in a file (presuming that some index is already present), I agree that this is simpler and much faster. As far as I know, seeking in a zip file isn't really efficient. If I thought I was going to need to do this, I'd want to store the files uncompressed. (As a practical matter, I can't think of a reason we'd need to do this here.) Mike |
From: Coleman, M. <MK...@St...> - 2006-09-20 18:17:46
|
> Brian Pratt: > Accuracy: Mass spec data in its raw form is generally stored=20 > in binary formats, since mass specs are front ended by binary > computers. Conversion to and from base 10 human readable=20 > representations introduces error. It's best to hold the data at its > original precision and translate out to human readable format=20 > at whatever precision is deemed useful for eyeballing. This is a complicated topic and I don't claim to be an expert by any means. Here's my understanding. Error is present, and we want to avoid amplifying it. If, for example, the instrument has an internal IEEE FP value 1234.56789012345 and we know that its precision is only +/- 0.1, then there's no particular benefit (nor harm) to reporting this as anything beyond 1234.6 or 1234.57. The 0.00089012345 is more or less noise. As a practical matter, it might be more efficient to move the IEEE bits directly from the instrument to the mzData file. A cost of doing this, though, is that this format is not human-readable. An alternative would be to fully represent the IEEE bits as a number. If I understand correctly, with properly implemented numeric I/O routines (in libc), you can have a 1-1 mapping between the internal and ASCII representation, so that it is possible to round trip without introducing error. This *would* make the textual representation larger, and it's not clear that it really makes sense to do this, because of the noise issue (above). One additional note: We seem to be assuming that mass specs all already do IEEE FP. Is this actually true? > File size: Sure, you can make files smaller by throwing away=20 > precision, but as you begin to desire higher precision base64 quickly > becomes much more efficient. Just to confirm, I agree that discarding *real* precision is unacceptable. (By "real", I mean what's being physically measured, not bits that are an artifact of the IEEE representation.) Mike |
From: Brian P. <bri...@in...> - 2006-09-20 19:21:52
|
> If I understand correctly, with properly implemented numeric I/O > routines (in libc), you can have a 1-1 mapping between the > internal and > ASCII representation, so that it is possible to round trip without > introducing error. Well, no, but this is something a lot of folks don't realize. For (a previously cited by Randy) example consider "0.1" - see http://www.yoda.arachsys.com/csharp/floatingpoint.html for an explanation. > One additional note: We seem to be assuming that mass specs > all already > do IEEE FP. Is this actually true? AFAIK, yes. That's a wheel that nobody has cared to reinvent for some time now. - Brian > -----Original Message----- > From: Coleman, Michael [mailto:MK...@St...] > Sent: Wednesday, September 20, 2006 11:17 AM > To: bri...@in...; psi...@li... > Subject: RE: [Psidev-ms-dev] Why base64? > > > Brian Pratt: > > > Accuracy: Mass spec data in its raw form is generally stored > > in binary formats, since mass specs are front ended by binary > > computers. Conversion to and from base 10 human readable > > representations introduces error. It's best to hold the data at its > > original precision and translate out to human readable format > > at whatever precision is deemed useful for eyeballing. > > This is a complicated topic and I don't claim to be an expert by any > means. Here's my understanding. > > Error is present, and we want to avoid amplifying it. If, > for example, > the instrument has an internal IEEE FP value 1234.56789012345 and we > know that its precision is only +/- 0.1, then there's no particular > benefit (nor harm) to reporting this as anything beyond 1234.6 or > 1234.57. The 0.00089012345 is more or less noise. > > As a practical matter, it might be more efficient to move the > IEEE bits > directly from the instrument to the mzData file. A cost of > doing this, > though, is that this format is not human-readable. > > An alternative would be to fully represent the IEEE bits as a number. > If I understand correctly, with properly implemented numeric I/O > routines (in libc), you can have a 1-1 mapping between the > internal and > ASCII representation, so that it is possible to round trip without > introducing error. This *would* make the textual > representation larger, > and it's not clear that it really makes sense to do this, > because of the > noise issue (above). > > One additional note: We seem to be assuming that mass specs > all already > do IEEE FP. Is this actually true? > > > > File size: Sure, you can make files smaller by throwing away > > precision, but as you begin to desire higher precision > base64 quickly > > becomes much more efficient. > > Just to confirm, I agree that discarding *real* precision is > unacceptable. (By "real", I mean what's being physically > measured, not > bits that are an artifact of the IEEE representation.) > > Mike > |
From: Coleman, M. <MK...@St...> - 2006-09-20 18:17:56
|
Jimmy makes an excellent point: some textual representations would be more useful than others. As far as I know, whitespace is whitespace in XML, so I would hope that producers of mzData files would choose whitespace to enhance the readability of the file. I wasn't really thinking it through, but the "ideal" representation I've been assuming in my head would look something like this <peaklist> 123.4 123 125.3 123343 127.4 23423 </peaklist> Obviously, as Brian points out, the form that splits the mz and intensity lists isn't as friendly <mzArray> 123.4 125.3 127.4 </mzArray> <intenArray> 123 123343 23423 </intenArray> I think someone mentioned something like this <peaklist> <peak><mz>123.4</mz><inten>123</inten></peak> <peak><mz>125.3</mz><inten>123343</inten></peak> <peak><mz>127.4</mz><inten>23423</inten></peak> </peaklist> which is better than the second but not as good as the first above. This seems more XML-ish than the above two format (or the current binary arrays), at the expense of being very verbose. With judicious addition of whitespace, this could made more readable (at a further cost in size) <peaklist> <peak><mz> 123.4 </mz><inten> 123 </inten></peak> <peak><mz> 125.3 </mz><inten> 123343 </inten></peak> <peak><mz> 127.4 </mz><inten> 23423 </inten></peak> </peaklist> Personally I'd be quite happy with the first form (at the top of this post). It may be slightly lacking in XML purity, but it's very readable, and it's clear what its semantics should be. The only real disadvantage I see is that it's a little different than the current mzData scheme, which breaks mz and intensity into separate lists. Mike |
From: Coleman, M. <MK...@St...> - 2006-09-20 18:18:16
|
> Randy Julian: > The XML-schema data types where tested by most of the vendors who > did not see the file size compression benefits you mention because they did > not feel they had the ability to round either of the vectors in the way you > suggest. I'm not unsympathetic to this practical concern. The most important thing would be to allow the textual representation as an equal variant (i.e., not buried in the supplemental data section). I'm not sure I see why generating the textual representation would be difficult, though. My guess is that the vendors will continue to use their own proprietary formats to do the initial recording of data, only translating to mzData as a final step. If the final step is carried out on a platform with a real libc, this looks like it would be straightforward. > Just as a note for your comment #3, this is not so straight=20 > forward. If the instrument collects data using an Intel chip, floating-point=20 > raw data will most likely have a IEEE-754 representation. So any time you=20 > have a number in a file like 0.1, the internal representation was=20 > originally different (0.1 cannot be exactly represented in IEEE-754). When you=20 > read from the file into an IEEE standard format, it will not be 0.1 in any of=20 > the math you do. I agree that this is complicated. As far as the mzData standard goes, probably the biggest thing that would help here would be a way for the data producer to indicate, in the mzData file, their idea of the accuracy of the measurements. If I understand correctly, currently this is implied, or communicated outside of the mzData file. Please let me add that I think the mzData format is a great improvement over the array of formats that it's meant to replace. I'd like this representation issue to be resolved in the best way possible, but it's certainly minor in the overall scheme of things. Mike |
From: Coleman, M. <MK...@St...> - 2006-09-20 21:44:53
|
> Brian Pratt: > > ...with properly implemented numeric I/O > > routines (in libc), you can have a 1-1 mapping between the=20 > > internal and > > ASCII representation, so that it is possible to round trip without > > introducing error.=20 >=20 > Well, no, but this is something a lot of folks don't realize.=20 > For (a previously cited by Randy) example consider "0.1" - see > http://www.yoda.arachsys.com/csharp/floatingpoint.html for an=20 > explanation. I think we're talking about two different things. As you say, 0.1 does not have an exact IEEE 754 representation. I'm talking about conversion between decimal and IEEE 754. Intuitively, for each IEEE 754 double, there are a set of decimal numbers closer to it than to any other double. Of that set, one will have the shortest decimal representation, after all trailing zeros have been truncated. (There may be two, which is handled by round-to-even.) This representation can in turn be uniquely mapped back to the double. I think that something like this is specified by IEEE 754, but I can't find an exact reference on the web. Java specifies this: =20 http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Double.html#toString(d ouble) and here's a discussion that seems to reference it =09 http://mail.python.org/pipermail/python-dev/2004-March/043742.html I probably don't have the details exactly right, but I believe the basic idea is correct. The effect of this is that it is possible to use a decimal representation without introducing any error. My preference, though, would still be to round away the noise digits. Mike |
From: Brian P. <bri...@in...> - 2006-09-20 22:00:24
|
Hi Michael, Not sure I follow you... isn't "0.1" decimal? BTW that http://mail.python.org/pipermail/python-dev/2004-March/043742.html discussion ends on this note: "> Remember that every binary floating-point number has an exact decimal > representation (though the reverse, of course, is not true). Yup." So no, you can't always make the roundtrip without introducing error. More importantly, you can't always read an ASCII decimal value and compute with it without introducing error. And, as I mentioned before, that decimal->binary conversion isn't cheap. - Brian > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Coleman, Michael > Sent: Wednesday, September 20, 2006 2:44 PM > To: bri...@in...; psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > > Brian Pratt: > > > ...with properly implemented numeric I/O > > > routines (in libc), you can have a 1-1 mapping between the > > > internal and > > > ASCII representation, so that it is possible to round trip without > > > introducing error. > > > > Well, no, but this is something a lot of folks don't realize. > > For (a previously cited by Randy) example consider "0.1" - see > > http://www.yoda.arachsys.com/csharp/floatingpoint.html for an > > explanation. > > I think we're talking about two different things. As you > say, 0.1 does > not have an exact IEEE 754 representation. I'm talking about > conversion > between decimal and IEEE 754. Intuitively, for each IEEE 754 double, > there are a set of decimal numbers closer to it than to any other > double. Of that set, one will have the shortest decimal > representation, > after all trailing zeros have been truncated. (There may be > two, which > is handled by round-to-even.) This representation can in turn be > uniquely mapped back to the double. I think that something > like this is > specified by IEEE 754, but I can't find an exact reference on the web. > Java specifies this: > > > http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Double.html# > toString(d > ouble) > > and here's a discussion that seems to reference it > > > http://mail.python.org/pipermail/python-dev/2004-March/043742.html > > I probably don't have the details exactly right, but I > believe the basic > idea is correct. The effect of this is that it is possible to use a > decimal representation without introducing any error. My preference, > though, would still be to round away the noise digits. > > Mike > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Mike C. <tu...@gm...> - 2006-09-24 02:25:52
|
[My two previous attempts to send this appear to have failed. My apologies if anyone is seeing multiple copies. I also omitted the C program mentioned below, in case that might be tripping a spam filter. Drop me an email if you want a copy. --Mike] This is dry stuff, but I think it's important to see that IEEE 754 values can be transmitted in decimal form (via mzData) without any loss of precision whatsoever. Let me give a more concrete scenario. In our case, assume we have an IEEE 754 single-precision value in our instrument computer. We want to use mzData to transmit that value to another computer, so that ultimately the latter computer will contain the identical single-precision value. One way to do this is the current method of capturing the 32-bit representation and sending it across using the base64 encoding. Another way to send the value is to send an ASCII representation of a decimal number that will, upon being converted using strtof(3), result in the identical single-precision value. (That decimal number is *not* typically mathematically equal to the single-precision value, it's just closer to it than to any other single-precision value.) This really *is* a completely lossless representation. There are different ways to generate these decimal numbers. It is sufficient (if not necessarily optimal) to simply use printf(3) with sufficient precision (e.g., "%.8e"). This will work with implementations that do correct rounding. Linux (meaning GNU libc) has done this correctly since at least nine years ago--I would assume the vendors are doing it right, though this should be confirmed. I'm including a small C program that demonstrates what I'm talking about. It does an exhaustive check for the single-precision case. It takes a couple of hours to complete, but if you're going to see an error, it will probably occur pretty quickly. (If you see any errors, I'd like to know.) This doesn't change the fact that 0.1 doesn't have an exact IEEE 754 representation. That is a separate issue (and one that a base64 encoding does not address either). As far as the cost of conversion, I agree that it is likely larger than the cost of the base64 encoding. I don't have the libraries at hand to try it out, but I'm sure it would be detectable for large sets of spectra. That notwithstanding, we and everyone else who uses a format like ms2 or dta are already paying this cost, and it doesn't seem particularly onerous. CPU cycles are pretty cheap--human cycles (that transparency might save) are very dear. Mike |
From: Angel P. <an...@ma...> - 2006-09-24 15:11:05
|
I guess I am using "lossy" too loosely (say that 10 time fast). I meant that the conversion of a double or single decimal to the significant figures with respect to the limit of instrument detection we all know and love to see in plain text formats is a lossy translation. I was not implying that going from byte strings to the equivalent ascii translation was a lossy operation. Sorry for the confusion. But send me the C code and I will post it on the docstore. -angel Mike Coleman wrote: > Another way to send the value is to send an ASCII representation of a > decimal number that will, upon being converted using strtof(3), result > in the identical single-precision value. (That decimal number is > *not* typically mathematically equal to the single-precision value, > it's just closer to it than to any other single-precision value.) > > This really *is* a completely lossless representation. > > There are different ways to generate these decimal numbers. It is > sufficient (if not necessarily optimal) to simply use printf(3) with > sufficient precision (e.g., "%.8e"). This will work with > implementations that do correct rounding. Linux (meaning GNU libc) > has done this correctly since at least nine years ago--I would assume > the vendors are doing it right, though this should be confirmed. > > I'm including a small C program that demonstrates what I'm talking > about. It does an exhaustive check for the single-precision case. It > takes a couple of hours to complete, but if you're going to see an > error, it will probably occur pretty quickly. (If you see any errors, > I'd like to know.) > > This doesn't change the fact that 0.1 doesn't have an exact IEEE 754 > representation. That is a separate issue (and one that a base64 > encoding does not address either). > |
From: Mike C. <tu...@gm...> - 2006-10-04 19:12:38
|
[This message seems to have been bounced by Sourceforge, so I'm resending it. I'm sorry to see that apparently they are having serious email problems these days. See today's Slashdot article at http://it.slashdot.org/article.pl?sid=06/10/04/1324214. (Apparently the problem isn't limited to email coming from gmail accounts.) ] On 9/28/06, Mike Coleman <tu...@gm...> wrote: > Makes sense. To put it in other words, there are two questions here: > > 1. Are the values represented as base64-encoded bitstrings or as ASCII text? > > 2. Should the values be rounded to the precision of the instrument > (probably plus a digit, etc.), or should an arbitrary number of > figures be used? Again, this isn't about losing information, as we're > only discussing rounding away noise. > > These two questions are entirely orthogonal, as far as I can see, and > it would be possible to allow both options for both questions, if this > were seen as being worthwhile. The one interaction is that if you use > the ASCII text encoding, rounding the figures will make the mzData > file smaller. > > Regarding ambiguity, the ASCII text representation would allow > differing whitespace (which produce no semantic difference). I guess > the base64 encoding also allows differing surrounding whitespace. > > With respect to the base64 encoding, one corner case comes to mind. > Are special IEEE values like NaN, the infinities, negative zero, etc., > allowed? If so, what should the interpretation be? > > Mike > > > The example code I mentioned: > > /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ > > /* strtof is GNU/C99 */ > #define _GNU_SOURCE > > #include <assert.h> > #include <errno.h> > #include <limits.h> > #include <stdio.h> > #include <stdlib.h> > > > union bits { > unsigned int u; > float f; > }; > > > int > main() { > unsigned int i; > union bits x, x2; > int zeros_seen = 0; > > assert(sizeof x.u == sizeof x.f); > assert(&x.u == &x.f); > > > > for (i=0; ; i++) { > char buf[128]; > > if (i == 0) > if (++zeros_seen > 1) > break; > > #if 0 > if (!(i % 100000)) > putc('.', stderr); > #endif > > x.u = i; > if (x.f != x.f) > continue; /* skip error values */ > > sprintf(buf, "%.8e", x.f); > > errno = 0; > x2.f = strtof(buf, 0); > if (errno == ERANGE) { > printf("strtof error for %s\n", buf); > continue; > } > > if (x2.u != x.u) > printf("bit difference for %s (%u != %u)\n", buf, x2.u, x.u); > } > } > |
From: Brian P. <bri...@in...> - 2006-09-19 23:31:56
|
When we developed the mzXML format we went through the same questions. This is how I understood things: Readability: We as developers are an unusual use case. The more likely use case for these formats is in visualization or automated processing, neither of which require direct eyeballing of the peak lists under normal circumstances. Or at least that's how we saw it. If you do really need to eyeball the peak lists there are lots of tools available that will do the translation for you. Accuracy: Mass spec data in its raw form is generally stored in binary formats, since mass specs are front ended by binary computers. Conversion to and from base 10 human readable representations introduces error. It's best to hold the data at its original precision and translate out to human readable format at whatever precision is deemed useful for eyeballing. File size: Sure, you can make files smaller by throwing away precision, but as you begin to desire higher precision base64 quickly becomes much more efficient. An excellent way to reduce file size is to compress the peaklists before base64'ing them, as is done in mzXML 3.0, and you do not sacrifice precision. Potential loss of precision information: That information wasn't ever there, really. Again, mass specs are front ended by binary computers, so that base 10 precision information (does '12345.099923123' mean '12345.1' or '12345.10' or'12345.100'?) wasn't ever in the datastream in the first place. The mass spec just wrote a bunch of 32 or 64 bit binary numbers to the best of its (base 2) ability. Looking at the bit patterns would be more revealing of the precision, and base64 preserves them. As a developer, you should be pleased that you don't have to wonder how many digits of that value are for real and not just an artifact of the base 2 to base 10 formatting conversion - with base64 binary values you're working with the original raw data, so those artifacts aren't an issue. Hope this helps, Brian Pratt www.insilicos.com/IPP > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Coleman, Michael > Sent: Tuesday, September 19, 2006 3:58 PM > To: Angel Pizarro; psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > > From: Angel Pizarro > > > > 1. Loss of readability. ... > > > There actually is a space for "human readable spectra" in the > > mzData format, > > I'm glad to hear that. I looked for this, but I did not see it in the > spec here > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > I was looking for something like a 'mzArray' and 'intenArray' tags, > which would be the textual alternatives to 'mzArrayBinary' and > 'intenArrayBinary'. Can you point me to an example? > > > > but really who reads individual mz and intensity values? > > Well--I do. As a programmer I don't think it's an exaggeration to say > that I'm looking at the peak lists in our ms2 files every day. I find > being able to see at a glance that the peaks are basically sane, and > their gross attributes (precision, count, etc.) very useful. > > Of course, as a programmer I can easy whip up a script to decode this > file format. I suspect most users would be stymied, though, > and I think > that that would be unfortunate. Since these files are part of a chain > of scientific argument, I think that as much as possible they ought to > be transparent and as open as possible to verification by > eyeball (mine > and those of our scientists) and alternative pieces of software. > > I'm not saying that this transparency is an absolute good. Perhaps it > is worth impairing so that we can have X, Y, and Z, which are > considered > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > 2. Increased file size. ... > > > Not a fair comparison. Most of the space in an mzData file is > > actually taken up by the human-readable parameters and parameter > > values of the spectra. > > Sorry, I should have been clearer. The numbers I gave were > just for the > peak lists (base64 vs text) and nothing else--no tags, no other > metadata. The rest of the mzData fields would add more > overhead, but I > have no objection about that part. > > If we implemented mzData here today, our files would be bigger if we > used the base64 encoding than if we used the textual numbers (as they > are in our ms2 files). > > > > > 3. Potential loss of precision information. ... > > > Actually the situtation may be reversed. Thermofinnigan, for > > example, stores measured values coming off of the instrument > > as double precision floats, later formatting the numbers as > > needed with respect to the specific instruments limit of detection. > > 12345.1 may have originally been 12345.099923123 in the vendors > > proprietary format. > > Okay, but isn't '12345.1' what I really want to see in this case > (assuming that the vendor is correct about the instrument's accuracy)? > For this particular instance, the string '12345.1' tells me > what I need > to know, and a double-precision floating point value (e.g., > 12345.10000000000036379) would sort of let me guess it (since > double-precision has significantly more significant figures). But a > single-precision value would leave me in a sort of gray area. > That is, > does '12345.099923123' mean '12345.1' or '12345.10' or > '12345.100', for > example? > > > > I wrote an email a few days ago showing how to translate in ruby > > the base64 arrays > > I saw it and it was quite useful to me. Part of the reason I'm asking > these questions is that I noticed in your examples that the > base64-encoded values actually took more space than the original data. > > Just to reiterate my main question, it looks like using > base64 will make > mzData less usable and more complex, as compared to straight > text. What > benefits come with it that offset these drawbacks? > > Mike > > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Brian P. <bri...@in...> - 2006-09-19 23:56:30
|
Oh, and I forgot one extremely important thing: performance. It's expensive converting those base 10 representations back to base 2 for number crunching, visualization etc. It's much cheaper to read them directly as binary, even with the overhead of base64 decoding. Brian Pratt www.insilicos.com/IPP > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Brian Pratt > Sent: Tuesday, September 19, 2006 4:31 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > > When we developed the mzXML format we went through the same > questions. This is how I understood things: > > Readability: We as developers are an unusual use case. The > more likely use case for these formats is in visualization or > automated > processing, neither of which require direct eyeballing of the > peak lists under normal circumstances. Or at least that's how we saw > it. If you do really need to eyeball the peak lists there > are lots of tools available that will do the translation for you. > > Accuracy: Mass spec data in its raw form is generally stored > in binary formats, since mass specs are front ended by binary > computers. Conversion to and from base 10 human readable > representations introduces error. It's best to hold the data at its > original precision and translate out to human readable format > at whatever precision is deemed useful for eyeballing. > > File size: Sure, you can make files smaller by throwing away > precision, but as you begin to desire higher precision base64 quickly > becomes much more efficient. An excellent way to reduce file > size is to compress the peaklists before base64'ing them, as is done > in mzXML 3.0, and you do not sacrifice precision. > > Potential loss of precision information: That information > wasn't ever there, really. Again, mass specs are front ended > by binary > computers, so that base 10 precision information (does > '12345.099923123' mean '12345.1' or '12345.10' > or'12345.100'?) wasn't ever in > the datastream in the first place. The mass spec just wrote > a bunch of 32 or 64 bit binary numbers to the best of its (base 2) > ability. Looking at the bit patterns would be more revealing > of the precision, and base64 preserves them. As a developer, you > should be pleased that you don't have to wonder how many > digits of that value are for real and not just an artifact of > the base 2 to > base 10 formatting conversion - with base64 binary values > you're working with the original raw data, so those artifacts > aren't an > issue. > > Hope this helps, > > Brian Pratt > www.insilicos.com/IPP > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Coleman, Michael > > Sent: Tuesday, September 19, 2006 3:58 PM > > To: Angel Pizarro; psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > From: Angel Pizarro > > > > > > 1. Loss of readability. ... > > > > > There actually is a space for "human readable spectra" in the > > > mzData format, > > > > I'm glad to hear that. I looked for this, but I did not > see it in the > > spec here > > > > > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > > > I was looking for something like a 'mzArray' and 'intenArray' tags, > > which would be the textual alternatives to 'mzArrayBinary' and > > 'intenArrayBinary'. Can you point me to an example? > > > > > > > but really who reads individual mz and intensity values? > > > > Well--I do. As a programmer I don't think it's an > exaggeration to say > > that I'm looking at the peak lists in our ms2 files every > day. I find > > being able to see at a glance that the peaks are basically sane, and > > their gross attributes (precision, count, etc.) very useful. > > > > Of course, as a programmer I can easy whip up a script to > decode this > > file format. I suspect most users would be stymied, though, > > and I think > > that that would be unfortunate. Since these files are part > of a chain > > of scientific argument, I think that as much as possible > they ought to > > be transparent and as open as possible to verification by > > eyeball (mine > > and those of our scientists) and alternative pieces of software. > > > > I'm not saying that this transparency is an absolute good. > Perhaps it > > is worth impairing so that we can have X, Y, and Z, which are > > considered > > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > > > > 2. Increased file size. ... > > > > > Not a fair comparison. Most of the space in an mzData file is > > > actually taken up by the human-readable parameters and parameter > > > values of the spectra. > > > > Sorry, I should have been clearer. The numbers I gave were > > just for the > > peak lists (base64 vs text) and nothing else--no tags, no other > > metadata. The rest of the mzData fields would add more > > overhead, but I > > have no objection about that part. > > > > If we implemented mzData here today, our files would be bigger if we > > used the base64 encoding than if we used the textual > numbers (as they > > are in our ms2 files). > > > > > > > > 3. Potential loss of precision information. ... > > > > > Actually the situtation may be reversed. Thermofinnigan, for > > > example, stores measured values coming off of the instrument > > > as double precision floats, later formatting the numbers as > > > needed with respect to the specific instruments limit of > detection. > > > 12345.1 may have originally been 12345.099923123 in the vendors > > > proprietary format. > > > > Okay, but isn't '12345.1' what I really want to see in this case > > (assuming that the vendor is correct about the instrument's > accuracy)? > > For this particular instance, the string '12345.1' tells me > > what I need > > to know, and a double-precision floating point value (e.g., > > 12345.10000000000036379) would sort of let me guess it (since > > double-precision has significantly more significant figures). But a > > single-precision value would leave me in a sort of gray area. > > That is, > > does '12345.099923123' mean '12345.1' or '12345.10' or > > '12345.100', for > > example? > > > > > > > I wrote an email a few days ago showing how to translate in ruby > > > the base64 arrays > > > > I saw it and it was quite useful to me. Part of the reason > I'm asking > > these questions is that I noticed in your examples that the > > base64-encoded values actually took more space than the > original data. > > > > Just to reiterate my main question, it looks like using > > base64 will make > > mzData less usable and more complex, as compared to straight > > text. What > > benefits come with it that offset these drawbacks? > > > > Mike > > > > > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Steve S. <ste...@ni...> - 2006-09-20 11:54:17
|
All, I also have the concerns expressed by Michael - transparency is important to us, but precision even more so. We have long stored our data in ASCII to avoid the problem, even though some judgement is sometimes necessary. As we know 1.0000 and 0.9999 are very different things, usually the former is really meant to be an integer. Also, abundances, since derived from ion counts, are 'naturally' integral, as m/z values are real - of course data systems need not conform to nature. I have dealt with MS formats where everything is, in effect, integral. In our library, for example, we want the users to see the values that we put there, so we use ASCII. It would be very desirable for us if the same were offered in the XML's - otherwise we will have to go non-standard. Perhaps the ultimate answer is some way of associating uncertainty with values, but I suppose this is a long way off. -Steve Stein p.s. (this is NOT NIST speaking, just one of its employees). At 9/19/2006 07:56 PM Tuesday, Brian Pratt wrote: > >Oh, and I forgot one extremely important thing: performance. It's >expensive converting those base 10 representations back to base 2 >for number crunching, visualization etc. It's much cheaper to read them >directly as binary, even with the overhead of base64 >decoding. > >Brian Pratt >www.insilicos.com/IPP > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Brian Pratt > > Sent: Tuesday, September 19, 2006 4:31 PM > > To: psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > When we developed the mzXML format we went through the same > > questions. This is how I understood things: > > > > Readability: We as developers are an unusual use case. The > > more likely use case for these formats is in visualization or > > automated > > processing, neither of which require direct eyeballing of the > > peak lists under normal circumstances. Or at least that's how we saw > > it. If you do really need to eyeball the peak lists there > > are lots of tools available that will do the translation for you. > > > > Accuracy: Mass spec data in its raw form is generally stored > > in binary formats, since mass specs are front ended by binary > > computers. Conversion to and from base 10 human readable > > representations introduces error. It's best to hold the data at its > > original precision and translate out to human readable format > > at whatever precision is deemed useful for eyeballing. > > > > File size: Sure, you can make files smaller by throwing away > > precision, but as you begin to desire higher precision base64 quickly > > becomes much more efficient. An excellent way to reduce file > > size is to compress the peaklists before base64'ing them, as is done > > in mzXML 3.0, and you do not sacrifice precision. > > > > Potential loss of precision information: That information > > wasn't ever there, really. Again, mass specs are front ended > > by binary > > computers, so that base 10 precision information (does > > '12345.099923123' mean '12345.1' or '12345.10' > > or'12345.100'?) wasn't ever in > > the datastream in the first place. The mass spec just wrote > > a bunch of 32 or 64 bit binary numbers to the best of its (base 2) > > ability. Looking at the bit patterns would be more revealing > > of the precision, and base64 preserves them. As a developer, you > > should be pleased that you don't have to wonder how many > > digits of that value are for real and not just an artifact of > > the base 2 to > > base 10 formatting conversion - with base64 binary values > > you're working with the original raw data, so those artifacts > > aren't an > > issue. > > > > Hope this helps, > > > > Brian Pratt > > www.insilicos.com/IPP > > > > > -----Original Message----- > > > From: psi...@li... > > > [mailto:psi...@li...] On > > > Behalf Of Coleman, Michael > > > Sent: Tuesday, September 19, 2006 3:58 PM > > > To: Angel Pizarro; psi...@li... > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > > From: Angel Pizarro > > > > > > > > 1. Loss of readability. ... > > > > > > > There actually is a space for "human readable spectra" in the > > > > mzData format, > > > > > > I'm glad to hear that. I looked for this, but I did not > > see it in the > > > spec here > > > > > > > > > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > > > > > I was looking for something like a 'mzArray' and 'intenArray' tags, > > > which would be the textual alternatives to 'mzArrayBinary' and > > > 'intenArrayBinary'. Can you point me to an example? > > > > > > > > > > but really who reads individual mz and intensity values? > > > > > > Well--I do. As a programmer I don't think it's an > > exaggeration to say > > > that I'm looking at the peak lists in our ms2 files every > > day. I find > > > being able to see at a glance that the peaks are basically sane, and > > > their gross attributes (precision, count, etc.) very useful. > > > > > > Of course, as a programmer I can easy whip up a script to > > decode this > > > file format. I suspect most users would be stymied, though, > > > and I think > > > that that would be unfortunate. Since these files are part > > of a chain > > > of scientific argument, I think that as much as possible > > they ought to > > > be transparent and as open as possible to verification by > > > eyeball (mine > > > and those of our scientists) and alternative pieces of software. > > > > > > I'm not saying that this transparency is an absolute good. > > Perhaps it > > > is worth impairing so that we can have X, Y, and Z, which are > > > considered > > > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > > > > > > > 2. Increased file size. ... > > > > > > > Not a fair comparison. Most of the space in an mzData file is > > > > actually taken up by the human-readable parameters and parameter > > > > values of the spectra. > > > > > > Sorry, I should have been clearer. The numbers I gave were > > > just for the > > > peak lists (base64 vs text) and nothing else--no tags, no other > > > metadata. The rest of the mzData fields would add more > > > overhead, but I > > > have no objection about that part. > > > > > > If we implemented mzData here today, our files would be bigger if we > > > used the base64 encoding than if we used the textual > > numbers (as they > > > are in our ms2 files). > > > > > > > > > > > 3. Potential loss of precision information. ... > > > > > > > Actually the situtation may be reversed. Thermofinnigan, for > > > > example, stores measured values coming off of the instrument > > > > as double precision floats, later formatting the numbers as > > > > needed with respect to the specific instruments limit of > > detection. > > > > 12345.1 may have originally been 12345.099923123 in the vendors > > > > proprietary format. > > > > > > Okay, but isn't '12345.1' what I really want to see in this case > > > (assuming that the vendor is correct about the instrument's > > accuracy)? > > > For this particular instance, the string '12345.1' tells me > > > what I need > > > to know, and a double-precision floating point value (e.g., > > > 12345.10000000000036379) would sort of let me guess it (since > > > double-precision has significantly more significant figures). But a > > > single-precision value would leave me in a sort of gray area. > > > That is, > > > does '12345.099923123' mean '12345.1' or '12345.10' or > > > '12345.100', for > > > example? > > > > > > > > > > I wrote an email a few days ago showing how to translate in ruby > > > > the base64 arrays > > > > > > I saw it and it was quite useful to me. Part of the reason > > I'm asking > > > these questions is that I noticed in your examples that the > > > base64-encoded values actually took more space than the > > original data. > > > > > > Just to reiterate my main question, it looks like using > > > base64 will make > > > mzData less usable and more complex, as compared to straight > > > text. What > > > benefits come with it that offset these drawbacks? > > > > > > Mike > > > > > > > > > > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > > &CID=DEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Psidev-ms-dev mailing list >Psi...@li... >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Angel P. <an...@ma...> - 2006-09-20 13:27:12
|
On Wednesday 20 September 2006 07:53, Steve Stein wrote: > All, > > I also have the concerns expressed by Michael - transparency is important > to us, but precision even more so. We have long stored our data in ASCII to > avoid the problem, even though some judgement is sometimes necessary. As we > know 1.0000 and 0.9999 are very different things, usually the former is > really meant to be an integer. Also, abundances, since derived from ion > counts, are 'naturally' integral, as m/z values are real - of course data > systems need not conform to nature. I have dealt with MS formats where > everything is, in effect, integral. > > In our library, for example, we want the users to see the values that we > put there, so we use ASCII. It would be very desirable for us if the same > were offered in the XML's - otherwise we will have to go non-standard. > > Perhaps the ultimate answer is some way of associating uncertainty with > values, but I suppose this is a long way off. > Hmmm...... well the XML schema base64binary type can encode integer arrays, but in mzData 1.05 we have defined the arrays as floats in the specification, but not the schema, hence this is not actually enforced. One could encode of the intenBinaryArray data as ints, but it would still be a non-standard usage. It would be better to supply the integer intensity in the supDataBInarrayArray and describe the array in the supDataDesc tag. So what I am getting at is that your use case is handled by mzData, but it the consumer of the data would have to know that to use the supplementary data arrays as the intensity values. Note that you would still have to specify the intensity values in the intenArrayBinary as floats, since this is a requirement of the schema. angel > -Steve Stein > > p.s. (this is NOT NIST speaking, just one of its employees). > > At 9/19/2006 07:56 PM Tuesday, Brian Pratt wrote: > >Oh, and I forgot one extremely important thing: performance. It's > >expensive converting those base 10 representations back to base 2 > >for number crunching, visualization etc. It's much cheaper to read them > >directly as binary, even with the overhead of base64 > >decoding. > > > >Brian Pratt > >www.insilicos.com/IPP > > > > > -----Original Message----- > > > From: psi...@li... > > > [mailto:psi...@li...] On > > > Behalf Of Brian Pratt > > > Sent: Tuesday, September 19, 2006 4:31 PM > > > To: psi...@li... > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > > > > When we developed the mzXML format we went through the same > > > questions. This is how I understood things: > > > > > > Readability: We as developers are an unusual use case. The > > > more likely use case for these formats is in visualization or > > > automated > > > processing, neither of which require direct eyeballing of the > > > peak lists under normal circumstances. Or at least that's how we saw > > > it. If you do really need to eyeball the peak lists there > > > are lots of tools available that will do the translation for you. > > > > > > Accuracy: Mass spec data in its raw form is generally stored > > > in binary formats, since mass specs are front ended by binary > > > computers. Conversion to and from base 10 human readable > > > representations introduces error. It's best to hold the data at its > > > original precision and translate out to human readable format > > > at whatever precision is deemed useful for eyeballing. > > > > > > File size: Sure, you can make files smaller by throwing away > > > precision, but as you begin to desire higher precision base64 quickly > > > becomes much more efficient. An excellent way to reduce file > > > size is to compress the peaklists before base64'ing them, as is done > > > in mzXML 3.0, and you do not sacrifice precision. > > > > > > Potential loss of precision information: That information > > > wasn't ever there, really. Again, mass specs are front ended > > > by binary > > > computers, so that base 10 precision information (does > > > '12345.099923123' mean '12345.1' or '12345.10' > > > or'12345.100'?) wasn't ever in > > > the datastream in the first place. The mass spec just wrote > > > a bunch of 32 or 64 bit binary numbers to the best of its (base 2) > > > ability. Looking at the bit patterns would be more revealing > > > of the precision, and base64 preserves them. As a developer, you > > > should be pleased that you don't have to wonder how many > > > digits of that value are for real and not just an artifact of > > > the base 2 to > > > base 10 formatting conversion - with base64 binary values > > > you're working with the original raw data, so those artifacts > > > aren't an > > > issue. > > > > > > Hope this helps, > > > > > > Brian Pratt > > > www.insilicos.com/IPP > > > > > > > -----Original Message----- > > > > From: psi...@li... > > > > [mailto:psi...@li...] On > > > > Behalf Of Coleman, Michael > > > > Sent: Tuesday, September 19, 2006 3:58 PM > > > > To: Angel Pizarro; psi...@li... > > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > > > > From: Angel Pizarro > > > > > > > > > > > 1. Loss of readability. ... > > > > > > > > > > There actually is a space for "human readable spectra" in the > > > > > mzData format, > > > > > > > > I'm glad to hear that. I looked for this, but I did not > > > > > > see it in the > > > > > > > spec here > > > > > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > > > > > > > I was looking for something like a 'mzArray' and 'intenArray' tags, > > > > which would be the textual alternatives to 'mzArrayBinary' and > > > > 'intenArrayBinary'. Can you point me to an example? > > > > > > > > > but really who reads individual mz and intensity values? > > > > > > > > Well--I do. As a programmer I don't think it's an > > > > > > exaggeration to say > > > > > > > that I'm looking at the peak lists in our ms2 files every > > > > > > day. I find > > > > > > > being able to see at a glance that the peaks are basically sane, and > > > > their gross attributes (precision, count, etc.) very useful. > > > > > > > > Of course, as a programmer I can easy whip up a script to > > > > > > decode this > > > > > > > file format. I suspect most users would be stymied, though, > > > > and I think > > > > that that would be unfortunate. Since these files are part > > > > > > of a chain > > > > > > > of scientific argument, I think that as much as possible > > > > > > they ought to > > > > > > > be transparent and as open as possible to verification by > > > > eyeball (mine > > > > and those of our scientists) and alternative pieces of software. > > > > > > > > I'm not saying that this transparency is an absolute good. > > > > > > Perhaps it > > > > > > > is worth impairing so that we can have X, Y, and Z, which are > > > > considered > > > > more valuable. I'm not seeing what X, Y, and Z are, though. > > > > > > > > > > 2. Increased file size. ... > > > > > > > > > > Not a fair comparison. Most of the space in an mzData file is > > > > > actually taken up by the human-readable parameters and parameter > > > > > values of the spectra. > > > > > > > > Sorry, I should have been clearer. The numbers I gave were > > > > just for the > > > > peak lists (base64 vs text) and nothing else--no tags, no other > > > > metadata. The rest of the mzData fields would add more > > > > overhead, but I > > > > have no objection about that part. > > > > > > > > If we implemented mzData here today, our files would be bigger if we > > > > used the base64 encoding than if we used the textual > > > > > > numbers (as they > > > > > > > are in our ms2 files). > > > > > > > > > > 3. Potential loss of precision information. ... > > > > > > > > > > Actually the situtation may be reversed. Thermofinnigan, for > > > > > example, stores measured values coming off of the instrument > > > > > as double precision floats, later formatting the numbers as > > > > > needed with respect to the specific instruments limit of > > > > > > detection. > > > > > > > > 12345.1 may have originally been 12345.099923123 in the vendors > > > > > proprietary format. > > > > > > > > Okay, but isn't '12345.1' what I really want to see in this case > > > > (assuming that the vendor is correct about the instrument's > > > > > > accuracy)? > > > > > > > For this particular instance, the string '12345.1' tells me > > > > what I need > > > > to know, and a double-precision floating point value (e.g., > > > > 12345.10000000000036379) would sort of let me guess it (since > > > > double-precision has significantly more significant figures). But a > > > > single-precision value would leave me in a sort of gray area. > > > > That is, > > > > does '12345.099923123' mean '12345.1' or '12345.10' or > > > > '12345.100', for > > > > example? > > > > > > > > > I wrote an email a few days ago showing how to translate in ruby > > > > > the base64 arrays > > > > > > > > I saw it and it was quite useful to me. Part of the reason > > > > > > I'm asking > > > > > > > these questions is that I noticed in your examples that the > > > > base64-encoded values actually took more space than the > > > > > > original data. > > > > > > > Just to reiterate my main question, it looks like using > > > > base64 will make > > > > mzData less usable and more complex, as compared to straight > > > > text. What > > > > benefits come with it that offset these drawbacks? > > > > > > > > Mike > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > > > ----------- > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > > Join SourceForge.net's Techsay panel and you'll get the > > > > chance to share your > > > > opinions on IT & business topics through brief surveys -- and > > > > earn cash > > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > > > &CID=DEVDEV > > > > _______________________________________________ > > > > Psidev-ms-dev mailing list > > > > Psi...@li... > > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > > &CID=DEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > >------------------------------------------------------------------------- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share > > your opinions on IT & business topics through brief surveys -- and earn > > cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > >Psidev-ms-dev mailing list > >Psi...@li... > >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... |
From: Angel P. <an...@ma...> - 2006-09-20 13:18:38
|
On Tuesday 19 September 2006 18:58, Coleman, Michael wrote: > > I'm glad to hear that. I looked for this, but I did not see it in the > spec here > > http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData > I am cringing as I write this, since I really think you should not go this route, but look at the supplementary data tags. > > Well--I do. As a programmer I don't think it's an exaggeration to say > that I'm looking at the peak lists in our ms2 files every day. I find > being able to see at a glance that the peaks are basically sane, and > their gross attributes (precision, count, etc.) very useful. ah, yes, but most probably you have to either zcat the file or unzip it in order to read the floats, then zip the whole file back again once finished, a situation not unlike decoding byte arrays and base64 strings.... > > Of course, as a programmer I can easy whip up a script to decode this > file format. I suspect most users would be stymied, though, and I think > that that would be unfortunate. Since these files are part of a chain > of scientific argument, I think that as much as possible they ought to > be transparent and as open as possible to verification by eyeball (mine > and those of our scientists) and alternative pieces of software. > This is really where mzData has failed the end user, namely in the set of tools that support it. Even basic marshal/unmarshal scripts are lacking. The "Specify it and they will come.." development hasn't panned out for us, sadly, so I am starting a development cycle here at UPenn to address these needs. Specifically a reasonably fast ruby framework for dealing with mzData (akin to some aspects of the TPP) starting off based on some code written by John Prince @ UTexas, called mspire. > Sorry, I should have been clearer. The numbers I gave were just for the > peak lists (base64 vs text) and nothing else--no tags, no other > metadata. The rest of the mzData fields would add more overhead, but I > have no objection about that part. > > If we implemented mzData here today, our files would be bigger if we > used the base64 encoding than if we used the textual numbers (as they > are in our ms2 files). Point taken. See Brian Pratt's responses as to why base64 is the way both mzData and mzXML are going (irrespective of the planned merge of the formats). I'll add to those arguments that we should look at the computational costs of un/zipping whole files as opposed to stream en/decoding individual mzData spectra. > > > > 3. Potential loss of precision information. ... Brian Pratt addressed these issues much more eloquently than me in his reply. > > Just to reiterate my main question, it looks like using base64 will make > mzData less usable and more complex, as compared to straight text. What > benefits come with it that offset these drawbacks? 1) it can handle encoding of integers, single and double precision float arrays without loss of information 2) comparable compression with zipped plain text of the same precision 3) better performance with respect to accessing individual spectra vs. compressed plain text -angel |