From: Marc S. <st...@in...> - 2009-06-12 15:04:32
|
Hi all, I think the way we store the data is widely accepted and we should not change it. If want your format human-readable, you can store the m/z values as comments or use text files. Another possiblity with mzML is to annotate each peak with a string containing the ascii representation of the m/z value. It's not human-readable because it is Base64 encoded, perhaps even zipped, but you can store the information like that if you want to. Best, Marc > Matt, > > Resolution depends on instrument, tuning and settings - I don't know the current state of reporting such information (or its reliability) in current instruments. > > We have long held all of our data in ASCII form (not just MS) - if you want flexibility and accuracy, this is the only path without inventing a new data structure. Error limits and annotation can be added as we like (peak labeling, for example). > > We will consider using comments - but I suspect no one will know they are there but us. > > Note that our focus is quite different from others - we are dealing with data that we have processed, perhaps heavily. I still ask for an optional ASCII data representation for reference data. > > -Steve > > -----Original Message----- > From: Matt Chambers [mailto:mat...@va...] > Sent: Friday, June 12, 2009 9:22 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder > > Now this I can agree with, especially with ppm representation when > appropriate. But doesn't the instrument's mass resolution and related CV > terms convey this information? And if someone doesn't write those at all > or can't write them in a machine-readable numeric representation, it > seems unlikely they will have done a proper job of rounding m/z values. > This is kind of the reason I was opposed to using strings to represent > mass resolution, but I was overruled. Perhaps we should revisit that? It > makes sense to me because it's a less redundant placement of this > precision information. > > Steve, do you agree with using XML comments to actually show > human-readable peak lists in the mzML? That seems like an orthogonal > issue to the precision one. > > -Matt > > > Stein, Stephen E. Dr. wrote: > >> that would be a nice addition - also allow ppm representation - more complex precision representations can be delayed for future versions. >> >> -----Original Message----- >> From: Fredrik Levander [mailto:Fre...@im...] >> Sent: Friday, June 12, 2009 8:28 AM >> To: Mass spectrometry standard development >> Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder >> >> Wouldn't it make sense to add an optional CV term for the number of >> significant digits in a binary array? This way it would be easy to get >> back to the ASCII representation if a peak list with x number of >> decimals was converted to mzML. It might not be so useful for conversion >> of raw data, but if a peak list have been rounded to a certain number of >> decimals, that's information which shouldn't been thrown away when >> converting to mzML. The info could also be used for a viewer to show the >> right number of decimals. >> >> Fredrik >> >> Pierre-Alain Binz wrote: >> >> >>> One question to Steve and others. >>> reading mzML, as well as any othe files, has to be done with an >>> editor, being a simple text editor or a more elaborated viewer. >>> >>> Would a more elaborated XML viewer/editor that knows how to read >>> binary data and round it if needed not be an ideal "straight" reader >>> of mzML instead of using a more plain text viewer? >>> I know and myself also like to "call back" values with a defined >>> number of digits, as they were entered. And it's up to the software >>> design to "not interpret" what I have entered. But today, it's >>> relatively easy to get a XML reader that could "translate" the binary >>> arrays in a "mz Intensity" two column format with appropriate rounding >>> if necessary, so that it looks exactly as if it was an ascii table >>> (don't forget that in mzML the mz and intensity arrays are separate >>> and anyway have to be interpreted to look like a 2 column ascii table. >>> If the answer is OK, then we could stay with binary format, taking >>> care of the "precision issue" via the graphical view, and be therefore >>> compatible with the ascii precision. >>> >>> This sounds like a way to bring the technical question to a more >>> phylosophical, "ergonomic" one, but probably worth at that stage. >>> >>> Pierre-Alain >>> >>> Matthew Chambers wrote: >>> >>> >>>> No measurements I'm aware of in proteomic mass spec use more than 15 >>>> base 10 digits, which is the number of digits that double precision >>>> floats can represent without precision loss. That means that even if a >>>> value goes in as 1.5 (which can't be represented exactly), then as long >>>> as we round to the 15th digit we don't lose precision. As others have >>>> said, we can thus "round-trip" 15 digits. We get this high degree of >>>> fidelity to the source data without all the assumptions involved with >>>> the ASCII representation: I use doubles consistently then I'm always >>>> providing 15 significant digits. And if we did need more than 15, then >>>> ASCII is still a very inefficient encoding. You'd want to use arbitrary >>>> precision fixed or floating point binary types, which can't be computed >>>> on very easily or efficiently, but they are the Right Way to achieve >>>> arbitrary precision (i.e. no unspecified assumptions, well defined byte >>>> width, fast parsing). >>>> >>>> So in fact, you can preserve this "poor person's" significant digits >>>> encoding: if the software is doing its job, then it will go out the same >>>> way it came in! The real nastiness with floating point is when the >>>> precision loss accumulates every time an arithmetic operation happens on >>>> a cumulative sum or product. >>>> >>>> -Matt >>>> >>>> >>>> Stein, Stephen E. Dr. wrote: >>>> >>>> >>>> >>>>> Yes, that is what I had in mind - you get drilled in that when you take a lab course in Chemistry or Physics (maybe it has been dropped in recent years). It is a poor person's way of providing error limits (the lowest significant figure contains the precision of measurement). >>>>> >>>>> It is true that if only affects 10% of values, but that's enough for me to be concerned. I suppose we could put ASCII in a comment field, but physical quantities do have precisions, and stuffing measured values in those floating formats loses some of it. >>>>> >>>>> Sorry to say, this problem generally affects binary representations of measured values - one reason why I have liked the ASCII nature of XML - and hate to lose it. >>>>> >>>>> -Steve >>>>> >>>>> -----Original Message----- >>>>> From: Mike Coleman [mailto:tu...@gm...] >>>>> Sent: Thursday, June 11, 2009 4:41 PM >>>>> To: Mass spectrometry standard development >>>>> Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder >>>>> >>>>> I took it to mean that with "1", "1.5", "1.50", one gets an implied >>>>> level of precision. That is, "1.5" is generally understood to mean >>>>> 1.5 +/- 0.05. If I give you the IEEE float 1.5, much less is implied >>>>> about the precision of this value, unless it's explicitly stated >>>>> elsewhere. (If you have a whole set of these, then you probably can >>>>> work out the equivalent precision, but this is a bit of a stretch.) >>>>> >>>>> Mike >>>>> >>>>> >>>>> On Thu, Jun 11, 2009 at 3:23 PM, Angel Pizarro<an...@ma...> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Is your question whether we can successfully round-trip the numbers? Eg. go >>>>>> from an ascii format to mzML back to originating ascii format and get the >>>>>> same exact numbers? I believe that when we pack the numbers and unpack them >>>>>> (at least in my non-validating ruby implementations) the numbers and >>>>>> significance are completely the same. E.g. 1.005 === 1.005 and not >>>>>> 1.005000000000001 >>>>>> -angel >>>>>> >>>>>> >>>>>> > > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Matthew C. <mat...@va...> - 2009-06-12 15:07:30
|
Stein, Stephen E. Dr. wrote: > Matt, > > Resolution depends on instrument, tuning and settings - I don't know the current state of reporting such information (or its reliability) in current instruments. > Right, that's my understanding. But without knowing this information, rounding m/z values in ASCII or binary is dangerously lossy. > We have long held all of our data in ASCII form (not just MS) - if you want flexibility and accuracy, this is the only path without inventing a new data structure. Error limits and annotation can be added as we like (peak labeling, for example). > This is what I was refuting below. Assuming 15 or fewer base10 digits are needed, a double precision float is a better representation than ASCII in every way except human readability. Do you have examples of reference data that uses more than 15 digits in ASCII? Peak annotations can be added to both the XML comments and in a non-standard data array for each spectrum: null terminated strings are a new binary data type we agreed to support after this week's conference call. > We will consider using comments - but I suspect no one will know they are there but us. > And how is that different than no one knowing whether an optional ASCII data representation is in the file? I guarantee you that the XML comment will be more human readable than the ASCII representations that have been proposed so far. And unless you can demonstrate that you need more than 15 digits of precision in your data, human readability is the only reason for ASCII representation. -Matt > Note that our focus is quite different from others - we are dealing with data that we have processed, perhaps heavily. I still ask for an optional ASCII data representation for reference data. > > -Steve > > -----Original Message----- > From: Matt Chambers [mailto:mat...@va...] > Sent: Friday, June 12, 2009 9:22 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder > > Now this I can agree with, especially with ppm representation when > appropriate. But doesn't the instrument's mass resolution and related CV > terms convey this information? And if someone doesn't write those at all > or can't write them in a machine-readable numeric representation, it > seems unlikely they will have done a proper job of rounding m/z values. > This is kind of the reason I was opposed to using strings to represent > mass resolution, but I was overruled. Perhaps we should revisit that? It > makes sense to me because it's a less redundant placement of this > precision information. > > Steve, do you agree with using XML comments to actually show > human-readable peak lists in the mzML? That seems like an orthogonal > issue to the precision one. > > -Matt > > > Stein, Stephen E. Dr. wrote: > >> that would be a nice addition - also allow ppm representation - more complex precision representations can be delayed for future versions. >> >> -----Original Message----- >> From: Fredrik Levander [mailto:Fre...@im...] >> Sent: Friday, June 12, 2009 8:28 AM >> To: Mass spectrometry standard development >> Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder >> >> Wouldn't it make sense to add an optional CV term for the number of >> significant digits in a binary array? This way it would be easy to get >> back to the ASCII representation if a peak list with x number of >> decimals was converted to mzML. It might not be so useful for conversion >> of raw data, but if a peak list have been rounded to a certain number of >> decimals, that's information which shouldn't been thrown away when >> converting to mzML. The info could also be used for a viewer to show the >> right number of decimals. >> >> Fredrik >> >> Pierre-Alain Binz wrote: >> >> >>> One question to Steve and others. >>> reading mzML, as well as any othe files, has to be done with an >>> editor, being a simple text editor or a more elaborated viewer. >>> >>> Would a more elaborated XML viewer/editor that knows how to read >>> binary data and round it if needed not be an ideal "straight" reader >>> of mzML instead of using a more plain text viewer? >>> I know and myself also like to "call back" values with a defined >>> number of digits, as they were entered. And it's up to the software >>> design to "not interpret" what I have entered. But today, it's >>> relatively easy to get a XML reader that could "translate" the binary >>> arrays in a "mz Intensity" two column format with appropriate rounding >>> if necessary, so that it looks exactly as if it was an ascii table >>> (don't forget that in mzML the mz and intensity arrays are separate >>> and anyway have to be interpreted to look like a 2 column ascii table. >>> If the answer is OK, then we could stay with binary format, taking >>> care of the "precision issue" via the graphical view, and be therefore >>> compatible with the ascii precision. >>> >>> This sounds like a way to bring the technical question to a more >>> phylosophical, "ergonomic" one, but probably worth at that stage. >>> >>> Pierre-Alain >>> >>> Matthew Chambers wrote: >>> >>> >>>> No measurements I'm aware of in proteomic mass spec use more than 15 >>>> base 10 digits, which is the number of digits that double precision >>>> floats can represent without precision loss. That means that even if a >>>> value goes in as 1.5 (which can't be represented exactly), then as long >>>> as we round to the 15th digit we don't lose precision. As others have >>>> said, we can thus "round-trip" 15 digits. We get this high degree of >>>> fidelity to the source data without all the assumptions involved with >>>> the ASCII representation: I use doubles consistently then I'm always >>>> providing 15 significant digits. And if we did need more than 15, then >>>> ASCII is still a very inefficient encoding. You'd want to use arbitrary >>>> precision fixed or floating point binary types, which can't be computed >>>> on very easily or efficiently, but they are the Right Way to achieve >>>> arbitrary precision (i.e. no unspecified assumptions, well defined byte >>>> width, fast parsing). >>>> >>>> So in fact, you can preserve this "poor person's" significant digits >>>> encoding: if the software is doing its job, then it will go out the same >>>> way it came in! The real nastiness with floating point is when the >>>> precision loss accumulates every time an arithmetic operation happens on >>>> a cumulative sum or product. >>>> >>>> -Matt >>>> >>>> >>>> Stein, Stephen E. Dr. wrote: >>>> >>>> >>>> >>>>> Yes, that is what I had in mind - you get drilled in that when you take a lab course in Chemistry or Physics (maybe it has been dropped in recent years). It is a poor person's way of providing error limits (the lowest significant figure contains the precision of measurement). >>>>> >>>>> It is true that if only affects 10% of values, but that's enough for me to be concerned. I suppose we could put ASCII in a comment field, but physical quantities do have precisions, and stuffing measured values in those floating formats loses some of it. >>>>> >>>>> Sorry to say, this problem generally affects binary representations of measured values - one reason why I have liked the ASCII nature of XML - and hate to lose it. >>>>> >>>>> -Steve >>>>> >>>>> -----Original Message----- >>>>> From: Mike Coleman [mailto:tu...@gm...] >>>>> Sent: Thursday, June 11, 2009 4:41 PM >>>>> To: Mass spectrometry standard development >>>>> Subject: Re: [Psidev-ms-dev] PSI-MSS WG Tuesday call reminder >>>>> >>>>> I took it to mean that with "1", "1.5", "1.50", one gets an implied >>>>> level of precision. That is, "1.5" is generally understood to mean >>>>> 1.5 +/- 0.05. If I give you the IEEE float 1.5, much less is implied >>>>> about the precision of this value, unless it's explicitly stated >>>>> elsewhere. (If you have a whole set of these, then you probably can >>>>> work out the equivalent precision, but this is a bit of a stretch.) >>>>> >>>>> Mike >>>>> >>>>> >>>>> On Thu, Jun 11, 2009 at 3:23 PM, Angel Pizarro<an...@ma...> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Is your question whether we can successfully round-trip the numbers? Eg. go >>>>>> from an ascii format to mzML back to originating ascii format and get the >>>>>> same exact numbers? I believe that when we pack the numbers and unpack them >>>>>> (at least in my non-validating ruby implementations) the numbers and >>>>>> significance are completely the same. E.g. 1.005 === 1.005 and not >>>>>> 1.005000000000001 >>>>>> -angel >>>>>> >>>>>> >>>>>> > > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Coleman, M. <MK...@st...> - 2009-06-29 17:18:41
|
I¹ve been on vacation, so this is a bit late. Comments below. On 6/12/09 10:05 AM, "Matthew Chambers" <mat...@va...> wrote: > This is what I was refuting below. Assuming 15 or fewer base10 digits > are needed, a double precision float is a better representation than > ASCII in every way except human readability. Do you have examples of > reference data that uses more than 15 digits in ASCII? For what it's worth, in greylag, the mass used for O (Oxygen) has 12 decimal digits to the right of the decimal point. (This value comes from NIST, and is meant to be as precise as possible.) Since peptides/proteins have masses of at least 1000 Da, this means that at least 16-17 significant digits would be needed to fully represent these calculations. One might dispute whether or not this level of precision is useful, but since you asked, there's an example. > And unless you can demonstrate that you need more > than 15 digits of precision in your data, human readability is the only > reason for ASCII representation. I would argue that the possibility of writing trivial programs that read peak data is also a reason, perhaps a more important one. Having the peaks encoded does make it a bit harder to jump in and start doing something with them. Mike |
From: Matthew C. <mat...@va...> - 2009-06-30 23:12:37
|
Hi Mike, Are you using long doubles in greylag? The reasonable fix if more than 15 digits are truly needed is to use a bigger data type, although a standard and portable long double does not exist AFAIK. If one wanted to write trivial code to read XML, it would probably be a simple token parsing approach in which case reading the XML comments I proposed earlier is even easier than reading some cooked up ASCII notation. And remember that the ASCII notation, whether in standard form or in XML comments, would necessarily be an optional representation. I shudder with glee at the thought of how much fun that the optional "standard form" would be to deal with! ;) A DOM approach is conceivable, but unlikely to be scalable and in any case if you've got the facilities to read XML with a DOM then you almost certainly have access to base64 decoding or can get it easily. -Matt Coleman, Michael wrote: > I¹ve been on vacation, so this is a bit late. Comments below. > > On 6/12/09 10:05 AM, "Matthew Chambers" <mat...@va...> > wrote: > >> This is what I was refuting below. Assuming 15 or fewer base10 digits >> are needed, a double precision float is a better representation than >> ASCII in every way except human readability. Do you have examples of >> reference data that uses more than 15 digits in ASCII? >> > > For what it's worth, in greylag, the mass used for O (Oxygen) has 12 decimal > digits to the right of the decimal point. (This value comes from NIST, and > is meant to be as precise as possible.) Since peptides/proteins have masses > of at least 1000 Da, this means that at least 16-17 significant digits would > be needed to fully represent these calculations. > > One might dispute whether or not this level of precision is useful, but > since you asked, there's an example. > > >> And unless you can demonstrate that you need more >> than 15 digits of precision in your data, human readability is the only >> reason for ASCII representation. >> > > I would argue that the possibility of writing trivial programs that read > peak data is also a reason, perhaps a more important one. Having the peaks > encoded does make it a bit harder to jump in and start doing something with > them. > > Mike > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Mike C. <tu...@gm...> - 2009-07-01 02:18:00
|
On Tue, Jun 30, 2009 at 6:11 PM, Matthew Chambers<mat...@va...> wrote: > Hi Mike, > > Are you using long doubles in greylag? The reasonable fix if more than > 15 digits are truly needed is to use a bigger data type, although a > standard and portable long double does not exist AFAIK. No, long doubles seem like overkill, at least at present. I gave the example only for informational purposes. It does appear to me that single precision floats are too small for some of the calculations required for recent instruments. Maybe they're enough for the peak representation itself--I'm not sure. > If one wanted to write trivial code to read XML... For the languages I use, access to XML parsing, base64 coding, and libz aren't really serious issues, but it is a little more involved than what I do with our current format (old ms2, which is basically one peak per line, represented as two floats in ASCII format), which makes it simple to do several basic transformations using standard Unix command-line tools. It's tempting to concentrate on the size of spectrum files as a metric, but the amount of programmer time it takes to do things probably matters more at my shop. |
From: Steffen N. <sne...@ip...> - 2009-07-01 07:42:20
|
On Mon, 2009-06-29 at 12:18 -0500, Coleman, Michael wrote: ... > I would argue that the possibility of writing trivial programs that read > peak data is also a reason, perhaps a more important one. Having the peaks > encoded does make it a bit harder to jump in and start doing something with > them. I'd say that was true before the web and Open Source stuff was all over the place. Googl'ing for "decrypt base64 <yourfavoritelanguage>" will almost always yield a 1-3 liner you can get inspiration from, including zip'ing the data. Funny, one of the top ranks for php gives you excerpt from a malware script: http://justin.madirish.net/node/321 Yours, Steffen -- IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 |
From: Steffen N. <sne...@ip...> - 2009-06-12 14:03:16
|
On Fri, 2009-06-12 at 14:27 +0200, Fredrik Levander wrote: > Wouldn't it make sense to add an optional CV term for the number of > significant digits in a binary array? Couldn't one express significant digits via ID: PSI:1000014 Name: Accuracy This is currently geared towards m/z in ppm. Should this be modified to be applied to time or intensities in other binary arrays as well ?! Yours, Steffen -- IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 |
From: Eric D. <ede...@sy...> - 2009-06-23 08:20:14
|
Hi everyone, the next PSI Mass Spectrometry Standards Working Group call will be Tuesday 8am PDT: http://www.timeanddate.com/worldclock/fixedtime.html?day=23 <http://www.timeanddate.com/worldclock/fixedtime.html?day=23&month=6&year=20 09&hour=16&min=0&sec=0&p1=136> &month=6&year=2009&hour=16&min=0&sec=0&p1=136 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-314-3683 Generic international: +44 2083222500 (UK number) access code: 297427 Agenda: 1) mzML 1.1.0 - Numerical precision - Validation of example files - Controlled vocabulary - Other items? ---- 2) Need to update the mzML implementations catalog ---- 3) MIAPE-MS revision - Have revised document to discuss at ASMS ---- 4) TraML development - Updates pending - Implementations? - Precision issue (1., 1.0, 1.00) - XML comment: ASCII 1.00 but what good is this? - just have a precision array type - Or significant digits? - 1001.5 has precision 1 and 5 significant digits? - 1.24e7 has precision -6 (?!) and 3 significant digits? |
From: Steffen N. <sne...@ip...> - 2009-06-23 10:18:29
|
On Tue, 2009-06-23 at 01:18 -0700, Eric Deutsch wrote: > Hi everyone, the next PSI Mass Spectrometry Standards Working Group > call will be Tuesday 8am PDT: I won't make it. My example file is one of those not yet validating :-( > 1) mzML 1.1.0 > - Validation of example files > - Controlled vocabulary I have the problem that there are no "absorbance units AU" yet, see my mail with some proposals: Subject: Re: [Psidev-ms-dev] Example files Date: Fri, 12 Jun 2009 15:38:55 +0200 Yours, Steffen -- IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 |