From: Randy J. <rkj...@in...> - 2006-10-05 11:27:53
|
There was concern in the NBT review of the mzData manuscript that the format was not able specifically designed for either quantitation or 'raw' data. Quite the opposite is true - it handles these better than it handles a 'peak list'. Given the broad scope we are going for, I think mzData 2.0 needs to cover both of Mike's suggestions. The representation should allow an ASCII list representation, _and_ a base64 list option. Within each of these, the _desired_ precision should be used. If you want to make some kind of 21CFR11 claim regarding GLP or GCP for clinical data (metabolites, proteins or biomarker analyses) then the ability to represent 'raw' data is critical and part of the current design. It is the simple case of 'represent a single tandem MS spectrum of a single peptide at only the precision of the m/z calibration' that is harder than it needs to be with the current representation. During the Washington PSI meeting a proposal was made to re-introduce the ASCII data representation that was dropped at the PSI meeting in Nice. What does everyone think of this idea? Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Wednesday, October 04, 2006 3:13 PM To: Angel Pizarro Cc: Psi...@li... Subject: Re: [Psidev-ms-dev] Why base64? [This message seems to have been bounced by Sourceforge, so I'm resending it. I'm sorry to see that apparently they are having serious email problems these days. See today's Slashdot article at http://it.slashdot.org/article.pl?sid=06/10/04/1324214. (Apparently the problem isn't limited to email coming from gmail accounts.) ] On 9/28/06, Mike Coleman <tu...@gm...> wrote: > Makes sense. To put it in other words, there are two questions here: > > 1. Are the values represented as base64-encoded bitstrings or as ASCII text? > > 2. Should the values be rounded to the precision of the instrument > (probably plus a digit, etc.), or should an arbitrary number of > figures be used? Again, this isn't about losing information, as we're > only discussing rounding away noise. > > These two questions are entirely orthogonal, as far as I can see, and > it would be possible to allow both options for both questions, if this > were seen as being worthwhile. The one interaction is that if you use > the ASCII text encoding, rounding the figures will make the mzData > file smaller. > > Regarding ambiguity, the ASCII text representation would allow > differing whitespace (which produce no semantic difference). I guess > the base64 encoding also allows differing surrounding whitespace. > > With respect to the base64 encoding, one corner case comes to mind. > Are special IEEE values like NaN, the infinities, negative zero, etc., > allowed? If so, what should the interpretation be? > > Mike > > > The example code I mentioned: > > /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ > > /* strtof is GNU/C99 */ > #define _GNU_SOURCE > > #include <assert.h> > #include <errno.h> > #include <limits.h> > #include <stdio.h> > #include <stdlib.h> > > > union bits { > unsigned int u; > float f; > }; > > > int > main() { > unsigned int i; > union bits x, x2; > int zeros_seen = 0; > > assert(sizeof x.u == sizeof x.f); > assert(&x.u == &x.f); > > > > for (i=0; ; i++) { > char buf[128]; > > if (i == 0) > if (++zeros_seen > 1) > break; > > #if 0 > if (!(i % 100000)) > putc('.', stderr); > #endif > > x.u = i; > if (x.f != x.f) > continue; /* skip error values */ > > sprintf(buf, "%.8e", x.f); > > errno = 0; > x2.f = strtof(buf, 0); > if (errno == ERANGE) { > printf("strtof error for %s\n", buf); > continue; > } > > if (x2.u != x.u) > printf("bit difference for %s (%u != %u)\n", buf, x2.u, x.u); > } > } > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Pierre-Alain B. <pie...@is...> - 2006-10-05 12:14:04
|
I am for the possibility to represent a spectrum/peaklist/even chromatogram in more than one manner ONLY if these representations are easy and straighforward to generate and to parse AND if there is a good (or better blocking) reason to do so. We need to avoid optional things that make any implementation subject to interpretation and missunderstanding. So yes only if the two formats are strictly and clearly described and discriminated (specification issue) Pierre-Alain Randy Julian wrote: >There was concern in the NBT review of the mzData manuscript that the format >was not able specifically designed for either quantitation or 'raw' data. >Quite the opposite is true - it handles these better than it handles a 'peak >list'. > >Given the broad scope we are going for, I think mzData 2.0 needs to cover >both of Mike's suggestions. > >The representation should allow an ASCII list representation, _and_ a base64 >list option. Within each of these, the _desired_ precision should be used. >If you want to make some kind of 21CFR11 claim regarding GLP or GCP for >clinical data (metabolites, proteins or biomarker analyses) then the ability >to represent 'raw' data is critical and part of the current design. > >It is the simple case of 'represent a single tandem MS spectrum of a single >peptide at only the precision of the m/z calibration' that is harder than it >needs to be with the current representation. > >During the Washington PSI meeting a proposal was made to re-introduce the >ASCII data representation that was dropped at the PSI meeting in Nice. What >does everyone think of this idea? > >Randy > >-----Original Message----- >From: psi...@li... >[mailto:psi...@li...] On Behalf Of Mike >Coleman >Sent: Wednesday, October 04, 2006 3:13 PM >To: Angel Pizarro >Cc: Psi...@li... >Subject: Re: [Psidev-ms-dev] Why base64? > >[This message seems to have been bounced by Sourceforge, so I'm >resending it. I'm sorry to see that apparently they are having >serious email problems these days. See today's Slashdot article at >http://it.slashdot.org/article.pl?sid=06/10/04/1324214. (Apparently >the problem isn't limited to email coming from gmail accounts.) ] > >On 9/28/06, Mike Coleman <tu...@gm...> wrote: > > >>Makes sense. To put it in other words, there are two questions here: >> >>1. Are the values represented as base64-encoded bitstrings or as ASCII >> >> >text? > > >>2. Should the values be rounded to the precision of the instrument >>(probably plus a digit, etc.), or should an arbitrary number of >>figures be used? Again, this isn't about losing information, as we're >>only discussing rounding away noise. >> >>These two questions are entirely orthogonal, as far as I can see, and >>it would be possible to allow both options for both questions, if this >>were seen as being worthwhile. The one interaction is that if you use >>the ASCII text encoding, rounding the figures will make the mzData >>file smaller. >> >>Regarding ambiguity, the ASCII text representation would allow >>differing whitespace (which produce no semantic difference). I guess >>the base64 encoding also allows differing surrounding whitespace. >> >>With respect to the base64 encoding, one corner case comes to mind. >>Are special IEEE values like NaN, the infinities, negative zero, etc., >>allowed? If so, what should the interpretation be? >> >>Mike >> >> >>The example code I mentioned: >> >>/* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ >> >>/* strtof is GNU/C99 */ >>#define _GNU_SOURCE >> >>#include <assert.h> >>#include <errno.h> >>#include <limits.h> >>#include <stdio.h> >>#include <stdlib.h> >> >> >>union bits { >> unsigned int u; >> float f; >>}; >> >> >>int >>main() { >> unsigned int i; >> union bits x, x2; >> int zeros_seen = 0; >> >> assert(sizeof x.u == sizeof x.f); >> assert(&x.u == &x.f); >> >> >> >> for (i=0; ; i++) { >> char buf[128]; >> >> if (i == 0) >> if (++zeros_seen > 1) >> break; >> >>#if 0 >> if (!(i % 100000)) >> putc('.', stderr); >>#endif >> >> x.u = i; >> if (x.f != x.f) >> continue; /* skip error values */ >> >> sprintf(buf, "%.8e", x.f); >> >> errno = 0; >> x2.f = strtof(buf, 0); >> if (errno == ERANGE) { >> printf("strtof error for %s\n", buf); >> continue; >> } >> >> if (x2.u != x.u) >> printf("bit difference for %s (%u != %u)\n", buf, x2.u, x.u); >> } >>} >> >> >> > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Psidev-ms-dev mailing list >Psi...@li... >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Psidev-ms-dev mailing list >Psi...@li... >https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > -- -- Dr. Pierre-Alain Binz Swiss Institute of Bioinformatics Proteome Informatics Group 1, Rue Michel Servet CH-1211 Geneve 4 Switzerland - - - - - - - - - - - - - - - - - Tel: +41-22-379 50 50 Fax: +41-22-379 58 58 Pie...@is... http://www.expasy.org/people/Pierre-Alain.Binz.html |
From: Brian P. <bri...@in...> - 2006-10-05 15:59:23
|
I'm strongly opposed to the change. In addtion to the previously discussed concerns about accuracy and the fundamental pointlessness due to the unsuitability of XML for eyeballing what is essentially columnar data, there's an additional and perhaps deeper practical concern: A data exchange standard that provides many ways to express the same idea is headed for the rocks. Vendors will tend to implement only the parts of the standard that interest them and the ecosystem quickly breaks down (I speak from experience with interchange standards in the internet security and circuit board manufacturing software industries, it's a phenomenon not peculiar to any one field of endeavor). A standard that provides n>1 ways to state the same thing is n times as difficult to implement and maintain, which reduces vendor enthusiasm by a factor of n (squared?), which hinders widespread adoption. As we sometimes say in the States, "If it ain't broke, don't fix it." Brian Pratt _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Pierre-Alain Binz Sent: Thursday, October 05, 2006 5:10 AM To: Randy Julian Cc: psi...@li... Subject: Re: [Psidev-ms-dev] FW: Why base64? I am for the possibility to represent a spectrum/peaklist/even chromatogram in more than one manner ONLY if these representations are easy and straighforward to generate and to parse AND if there is a good (or better blocking) reason to do so. We need to avoid optional things that make any implementation subject to interpretation and missunderstanding. So yes only if the two formats are strictly and clearly described and discriminated (specification issue) Pierre-Alain Randy Julian wrote: There was concern in the NBT review of the mzData manuscript that the format was not able specifically designed for either quantitation or 'raw' data. Quite the opposite is true - it handles these better than it handles a 'peak list'. Given the broad scope we are going for, I think mzData 2.0 needs to cover both of Mike's suggestions. The representation should allow an ASCII list representation, _and_ a base64 list option. Within each of these, the _desired_ precision should be used. If you want to make some kind of 21CFR11 claim regarding GLP or GCP for clinical data (metabolites, proteins or biomarker analyses) then the ability to represent 'raw' data is critical and part of the current design. It is the simple case of 'represent a single tandem MS spectrum of a single peptide at only the precision of the m/z calibration' that is harder than it needs to be with the current representation. During the Washington PSI meeting a proposal was made to re-introduce the ASCII data representation that was dropped at the PSI meeting in Nice. What does everyone think of this idea? Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Wednesday, October 04, 2006 3:13 PM To: Angel Pizarro Cc: Psi...@li... Subject: Re: [Psidev-ms-dev] Why base64? [This message seems to have been bounced by Sourceforge, so I'm resending it. I'm sorry to see that apparently they are having serious email problems these days. See today's Slashdot article at http://it.slashdot.org/article.pl?sid=06/10/04/1324214. (Apparently the problem isn't limited to email coming from gmail accounts.) ] On 9/28/06, Mike Coleman <mailto:tu...@gm...> <tu...@gm...> wrote: Makes sense. To put it in other words, there are two questions here: 1. Are the values represented as base64-encoded bitstrings or as ASCII text? 2. Should the values be rounded to the precision of the instrument (probably plus a digit, etc.), or should an arbitrary number of figures be used? Again, this isn't about losing information, as we're only discussing rounding away noise. These two questions are entirely orthogonal, as far as I can see, and it would be possible to allow both options for both questions, if this were seen as being worthwhile. The one interaction is that if you use the ASCII text encoding, rounding the figures will make the mzData file smaller. Regarding ambiguity, the ASCII text representation would allow differing whitespace (which produce no semantic difference). I guess the base64 encoding also allows differing surrounding whitespace. With respect to the base64 encoding, one corner case comes to mind. Are special IEEE values like NaN, the infinities, negative zero, etc., allowed? If so, what should the interpretation be? Mike The example code I mentioned: /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ /* strtof is GNU/C99 */ #define _GNU_SOURCE #include <assert.h> #include <errno.h> #include <limits.h> #include <stdio.h> #include <stdlib.h> union bits { unsigned int u; float f; }; int main() { unsigned int i; union bits x, x2; int zeros_seen = 0; assert(sizeof x.u == sizeof x.f); assert(&x.u == &x.f); for (i=0; ; i++) { char buf[128]; if (i == 0) if (++zeros_seen > 1) break; #if 0 if (!(i % 100000)) putc('.', stderr); #endif x.u = i; if (x.f != x.f) continue; /* skip error values */ sprintf(buf, "%.8e", x.f); errno = 0; x2.f = strtof(buf, 0); if (errno == ERANGE) { printf("strtof error for %s\n", buf); continue; } if (x2.u != x.u) printf("bit difference for %s (%u != %u)\n", buf, x2.u, x.u); } } ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV> &p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV> &p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- -- Dr. Pierre-Alain Binz Swiss Institute of Bioinformatics Proteome Informatics Group 1, Rue Michel Servet CH-1211 Geneve 4 Switzerland - - - - - - - - - - - - - - - - - Tel: +41-22-379 50 50 Fax: +41-22-379 58 58 Pie...@is... http://www.expasy.org/people/Pierre-Alain.Binz.html |
From: Angel P. <an...@ma...> - 2006-10-05 19:20:18
|
I have to second Brian on this one. From the operational and reporting requirements, having both ascii and binary representations just adds confusion. Better to address the problem of perceived complexity and general usage through tool development efforts. Also, this case: > It is the simple case of 'represent a single tandem MS spectrum of a single > peptide at only the precision of the m/z calibration' that is harder than it > needs to be with the current representation. > is not used outside of post analysis verification of the spectra (e.g. was the assignement of spectra valid, where the right peaks used for quant, etc.) Very low-throughput and NOT viewed outside of the analysis context. This is just my perception though, so if someone has an example please speak up. angel Brian Pratt wrote: > I'm strongly opposed to the change. In addtion to the previously > discussed concerns about accuracy and the fundamental pointlessness > due to the unsuitability of XML for eyeballing what is essentially > columnar data, there's an additional and perhaps deeper practical concern: > > A data exchange standard that provides many ways to express the same > idea is headed for the rocks. Vendors will tend to implement only the > parts of the standard that interest them and the ecosystem quickly > breaks down (I speak from experience with interchange standards in the > internet security and circuit board manufacturing software industries, > it's a phenomenon not peculiar to any one field of endeavor). A > standard that provides n>1 ways to state the same thing is n times as > difficult to implement and maintain, which reduces vendor enthusiasm > by a factor of n (squared?), which hinders widespread adoption. > > As we sometimes say in the States, "If it ain't broke, don't fix it." > > Brian Pratt > > > ------------------------------------------------------------------------ > *From:* psi...@li... > [mailto:psi...@li...] *On Behalf Of > *Pierre-Alain Binz > *Sent:* Thursday, October 05, 2006 5:10 AM > *To:* Randy Julian > *Cc:* psi...@li... > *Subject:* Re: [Psidev-ms-dev] FW: Why base64? > > I am for the possibility to represent a spectrum/peaklist/even > chromatogram in more than one manner ONLY if these representations > are easy and straighforward to generate and to parse AND if there > is a good (or better blocking) reason to do so. We need to avoid > optional things that make any implementation subject to > interpretation and missunderstanding. > So yes only if the two formats are strictly and clearly described > and discriminated (specification issue) > > Pierre-Alain > > Randy Julian wrote: >> There was concern in the NBT review of the mzData manuscript that the format >> was not able specifically designed for either quantitation or 'raw' data. >> Quite the opposite is true - it handles these better than it handles a 'peak >> list'. >> >> Given the broad scope we are going for, I think mzData 2.0 needs to cover >> both of Mike's suggestions. >> >> The representation should allow an ASCII list representation, _and_ a base64 >> list option. Within each of these, the _desired_ precision should be used. >> If you want to make some kind of 21CFR11 claim regarding GLP or GCP for >> clinical data (metabolites, proteins or biomarker analyses) then the ability >> to represent 'raw' data is critical and part of the current design. >> >> It is the simple case of 'represent a single tandem MS spectrum of a single >> peptide at only the precision of the m/z calibration' that is harder than it >> needs to be with the current representation. >> >> During the Washington PSI meeting a proposal was made to re-introduce the >> ASCII data representation that was dropped at the PSI meeting in Nice. What >> does everyone think of this idea? >> >> Randy >> >> -----Original Message----- >> From: psi...@li... >> [mailto:psi...@li...] On Behalf Of Mike >> Coleman >> Sent: Wednesday, October 04, 2006 3:13 PM >> To: Angel Pizarro >> Cc: Psi...@li... >> Subject: Re: [Psidev-ms-dev] Why base64? >> >> [This message seems to have been bounced by Sourceforge, so I'm >> resending it. I'm sorry to see that apparently they are having >> serious email problems these days. See today's Slashdot article at >> http://it.slashdot.org/article.pl?sid=06/10/04/1324214. (Apparently >> the problem isn't limited to email coming from gmail accounts.) ] >> >> On 9/28/06, Mike Coleman <tu...@gm...> wrote: >> >>> Makes sense. To put it in other words, there are two questions here: >>> >>> 1. Are the values represented as base64-encoded bitstrings or as ASCII >>> >> text? >> >>> 2. Should the values be rounded to the precision of the instrument >>> (probably plus a digit, etc.), or should an arbitrary number of >>> figures be used? Again, this isn't about losing information, as we're >>> only discussing rounding away noise. >>> >>> These two questions are entirely orthogonal, as far as I can see, and >>> it would be possible to allow both options for both questions, if this >>> were seen as being worthwhile. The one interaction is that if you use >>> the ASCII text encoding, rounding the figures will make the mzData >>> file smaller. >>> >>> Regarding ambiguity, the ASCII text representation would allow >>> differing whitespace (which produce no semantic difference). I guess >>> the base64 encoding also allows differing surrounding whitespace. >>> >>> With respect to the base64 encoding, one corner case comes to mind. >>> Are special IEEE values like NaN, the infinities, negative zero, etc., >>> allowed? If so, what should the interpretation be? >>> >>> Mike >>> >>> >>> The example code I mentioned: >>> >>> /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ >>> >>> /* strtof is GNU/C99 */ >>> #define _GNU_SOURCE >>> >>> #include <assert.h> >>> #include <errno.h> >>> #include <limits.h> >>> #include <stdio.h> >>> #include <stdlib.h> >>> >>> >>> union bits { >>> unsigned int u; >>> float f; >>> }; >>> >>> >>> int >>> main() { >>> unsigned int i; >>> union bits x, x2; >>> int zeros_seen = 0; >>> >>> assert(sizeof x.u == sizeof x.f); >>> assert(&x.u == &x.f); >>> >>> >>> >>> for (i=0; ; i++) { >>> char buf[128]; >>> >>> if (i == 0) >>> if (++zeros_seen > 1) >>> break; >>> >>> #if 0 >>> if (!(i % 100000)) >>> putc('.', stderr); >>> #endif >>> >>> x.u = i; >>> if (x.f != x.f) >>> continue; /* skip error values */ >>> >>> sprintf(buf, "%.8e", x.f); >>> >>> errno = 0; >>> x2.f = strtof(buf, 0); >>> if (errno == ERANGE) { >>> printf("strtof error for %s\n", buf); >>> continue; >>> } >>> >>> if (x2.u != x.u) >>> printf("bit difference for %s (%u != %u)\n", buf, x2.u, x.u); >>> } >>> } >>> >>> >> >> ------------------------------------------------------------------------- >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share your >> opinions on IT & business topics through brief surveys -- and earn cash >> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> >> >> ------------------------------------------------------------------------- >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share your >> opinions on IT & business topics through brief surveys -- and earn cash >> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> >> > > -- > -- > Dr. Pierre-Alain Binz > Swiss Institute of Bioinformatics > Proteome Informatics Group > 1, Rue Michel Servet > CH-1211 Geneve 4 > Switzerland > - - - - - - - - - - - - - - - - - > Tel: +41-22-379 50 50 > Fax: +41-22-379 58 58 > Pie...@is... > http://www.expasy.org/people/Pierre-Alain.Binz.html > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... |
From: Geer, L. \(NIH/NLM/NCBI\) [E] <le...@nc...> - 2006-10-06 14:27:57
|
Hi, I guess the general experience at NCBI is to make standards as flexible as possible while making them as explicit, easy to read, and validatible as possible. The pain of having multiple representations within the same standard is much less than the pain of having multiple standards, which can happen if a particular standard is too rigid. =20 The "easy to read" requirement means by both machine and human -- human readable probably being the most important because of all of the endless debugging required when reading and writing files. It seems much more fun writing new applications than dealing with import/export code! Lewis -----Original Message----- From: Angel Pizarro [mailto:an...@ma...]=20 Sent: Thursday, October 05, 2006 3:17 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] FW: Why base64? I have to second Brian on this one. From the operational and reporting=20 requirements, having both ascii and binary representations just adds=20 confusion. Better to address the problem of perceived complexity and=20 general usage through tool development efforts. Also, this case: > It is the simple case of 'represent a single tandem MS spectrum of a=20 > single peptide at only the precision of the m/z calibration' that is=20 > harder than it needs to be with the current representation. > =20 is not used outside of post analysis verification of the spectra (e.g.=20 was the assignement of spectra valid, where the right peaks used for=20 quant, etc.) Very low-throughput and NOT viewed outside of the analysis=20 context. This is just my perception though, so if someone has an example please=20 speak up. angel Brian Pratt wrote: > I'm strongly opposed to the change. In addtion to the previously > discussed concerns about accuracy and the fundamental pointlessness=20 > due to the unsuitability of XML for eyeballing what is essentially=20 > columnar data, there's an additional and perhaps deeper practical concern: > =20 > A data exchange standard that provides many ways to express the same > idea is headed for the rocks. Vendors will tend to implement only the > parts of the standard that interest them and the ecosystem quickly=20 > breaks down (I speak from experience with interchange standards in the > internet security and circuit board manufacturing software industries, > it's a phenomenon not peculiar to any one field of endeavor). A=20 > standard that provides n>1 ways to state the same thing is n times as=20 > difficult to implement and maintain, which reduces vendor enthusiasm=20 > by a factor of n (squared?), which hinders widespread adoption. > =20 > As we sometimes say in the States, "If it ain't broke, don't fix it." > =20 > Brian Pratt > =20 > > ------------------------------------------------------------------------ > *From:* psi...@li... > [mailto:psi...@li...] *On Behalf Of > *Pierre-Alain Binz > *Sent:* Thursday, October 05, 2006 5:10 AM > *To:* Randy Julian > *Cc:* psi...@li... > *Subject:* Re: [Psidev-ms-dev] FW: Why base64? > > I am for the possibility to represent a spectrum/peaklist/even > chromatogram in more than one manner ONLY if these representations > are easy and straighforward to generate and to parse AND if there > is a good (or better blocking) reason to do so. We need to avoid > optional things that make any implementation subject to > interpretation and missunderstanding. > So yes only if the two formats are strictly and clearly described > and discriminated (specification issue) > > Pierre-Alain > > Randy Julian wrote: >> There was concern in the NBT review of the mzData manuscript that the format >> was not able specifically designed for either quantitation or 'raw' data. >> Quite the opposite is true - it handles these better than it handles a 'peak >> list'. >> >> Given the broad scope we are going for, I think mzData 2.0 needs to cover >> both of Mike's suggestions. >> >> The representation should allow an ASCII list representation, _and_ a base64 >> list option. Within each of these, the _desired_ precision should be used. >> If you want to make some kind of 21CFR11 claim regarding GLP or GCP for >> clinical data (metabolites, proteins or biomarker analyses) then the ability >> to represent 'raw' data is critical and part of the current=20 >> design. >> >> It is the simple case of 'represent a single tandem MS spectrum of a single >> peptide at only the precision of the m/z calibration' that is harder than it >> needs to be with the current representation. >> >> During the Washington PSI meeting a proposal was made to re-introduce the >> ASCII data representation that was dropped at the PSI meeting in Nice. What >> does everyone think of this idea? >> >> Randy >> >> -----Original Message----- >> From: psi...@li... >> [mailto:psi...@li...] On Behalf Of Mike >> Coleman >> Sent: Wednesday, October 04, 2006 3:13 PM >> To: Angel Pizarro >> Cc: Psi...@li... >> Subject: Re: [Psidev-ms-dev] Why base64? >> >> [This message seems to have been bounced by Sourceforge, so I'm >> resending it. I'm sorry to see that apparently they are having >> serious email problems these days. See today's Slashdot article at >> http://it.slashdot.org/article.pl?sid=3D06/10/04/1324214. (Apparently >> the problem isn't limited to email coming from gmail accounts.) ] >> >> On 9/28/06, Mike Coleman <tu...@gm...> wrote: >> =20 >>> Makes sense. To put it in other words, there are two questions=20 >>> here: >>> >>> 1. Are the values represented as base64-encoded bitstrings or=20 >>> as ASCII >>> =20 >> text? >> =20 >>> 2. Should the values be rounded to the precision of the instrument >>> (probably plus a digit, etc.), or should an arbitrary number of >>> figures be used? Again, this isn't about losing information, as we're >>> only discussing rounding away noise. >>> >>> These two questions are entirely orthogonal, as far as I can see, and >>> it would be possible to allow both options for both questions, if this >>> were seen as being worthwhile. The one interaction is that if you use >>> the ASCII text encoding, rounding the figures will make the mzData >>> file smaller. >>> >>> Regarding ambiguity, the ASCII text representation would allow >>> differing whitespace (which produce no semantic difference). I guess >>> the base64 encoding also allows differing surrounding=20 >>> whitespace. >>> >>> With respect to the base64 encoding, one corner case comes to mind. >>> Are special IEEE values like NaN, the infinities, negative zero, etc., >>> allowed? If so, what should the interpretation be? >>> >>> Mike >>> >>> >>> The example code I mentioned: >>> >>> /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ >>> >>> /* strtof is GNU/C99 */ >>> #define _GNU_SOURCE >>> >>> #include <assert.h> >>> #include <errno.h> >>> #include <limits.h> >>> #include <stdio.h> >>> #include <stdlib.h> >>> >>> >>> union bits { >>> unsigned int u; >>> float f; >>> }; >>> >>> >>> int >>> main() { >>> unsigned int i; >>> union bits x, x2; >>> int zeros_seen =3D 0; >>> >>> assert(sizeof x.u =3D=3D sizeof x.f); >>> assert(&x.u =3D=3D &x.f); >>> >>> >>> >>> for (i=3D0; ; i++) { >>> char buf[128]; >>> >>> if (i =3D=3D 0) >>> if (++zeros_seen > 1) >>> break; >>> >>> #if 0 >>> if (!(i % 100000)) >>> putc('.', stderr); >>> #endif >>> >>> x.u =3D i; >>> if (x.f !=3D x.f) >>> continue; /* skip error values */ >>> >>> sprintf(buf, "%.8e", x.f); >>> >>> errno =3D 0; >>> x2.f =3D strtof(buf, 0); >>> if (errno =3D=3D ERANGE) { >>> printf("strtof error for %s\n", buf); >>> continue; >>> } >>> >>> if (x2.u !=3D x.u) >>> printf("bit difference for %s (%u !=3D %u)\n", buf, x2.u, x.u); >>> } >>> } >>> >>> =20 >> >> ------------------------------------------------------------------------ - >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share your >> opinions on IT & business topics through brief surveys -- and earn cash >> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> >> >> ------------------------------------------------------------------------ - >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share your >> opinions on IT & business topics through brief surveys -- and earn cash >> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> >> =20 > > --=20 > -- > Dr. Pierre-Alain Binz > Swiss Institute of Bioinformatics > Proteome Informatics Group > 1, Rue Michel Servet > CH-1211 Geneve 4 > Switzerland > - - - - - - - - - - - - - - - - - > Tel: +41-22-379 50 50 > Fax: +41-22-379 58 58 > Pie...@is... > http://www.expasy.org/people/Pierre-Alain.Binz.html > > ---------------------------------------------------------------------- > -- > > ---------------------------------------------------------------------- > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V > ------------------------------------------------------------------------ > > _______________________________________________ > Psidev-ms-dev mailing list Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > =20 --=20 Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... ------------------------------------------------------------------------ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Talapady N B. <bh...@ni...> - 2006-10-06 14:35:43
|
Hi, I fully agree. Rigid standards usually stay only on 'paper' and they foster chaos. 'import/export' codes are the breading grounds for multiple standards. Best regards, T N Bhat ----- Original Message ----- From: "Geer, Lewis (NIH/NLM/NCBI) [E]" <le...@nc...> To: <psi...@li...> Sent: Friday, October 06, 2006 10:27 AM Subject: Re: [Psidev-ms-dev] FW: Why base64? > Hi, > > I guess the general experience at NCBI is to make standards as flexible > as possible while making them as explicit, easy to read, and validatible > as possible. The pain of having multiple representations within the > same standard is much less than the pain of having multiple standards, > which can happen if a particular standard is too rigid. > > The "easy to read" requirement means by both machine and human -- human > readable probably being the most important because of all of the endless > debugging required when reading and writing files. It seems much more > fun writing new applications than dealing with import/export code! > > Lewis > > -----Original Message----- > From: Angel Pizarro [mailto:an...@ma...] > Sent: Thursday, October 05, 2006 3:17 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] FW: Why base64? > > > I have to second Brian on this one. From the operational and reporting > requirements, having both ascii and binary representations just adds > confusion. Better to address the problem of perceived complexity and > general usage through tool development efforts. > > Also, this case: > > It is the simple case of 'represent a single tandem MS spectrum of a > > single peptide at only the precision of the m/z calibration' that is > > harder than it needs to be with the current representation. > > > is not used outside of post analysis verification of the spectra (e.g. > was the assignement of spectra valid, where the right peaks used for > quant, etc.) Very low-throughput and NOT viewed outside of the analysis > context. > > This is just my perception though, so if someone has an example please > speak up. > > angel > > > Brian Pratt wrote: > > I'm strongly opposed to the change. In addtion to the previously > > discussed concerns about accuracy and the fundamental pointlessness > > due to the unsuitability of XML for eyeballing what is essentially > > columnar data, there's an additional and perhaps deeper practical > concern: > > > > A data exchange standard that provides many ways to express the same > > idea is headed for the rocks. Vendors will tend to implement only the > > > parts of the standard that interest them and the ecosystem quickly > > breaks down (I speak from experience with interchange standards in the > > > internet security and circuit board manufacturing software industries, > > > it's a phenomenon not peculiar to any one field of endeavor). A > > standard that provides n>1 ways to state the same thing is n times as > > difficult to implement and maintain, which reduces vendor enthusiasm > > by a factor of n (squared?), which hinders widespread adoption. > > > > As we sometimes say in the States, "If it ain't broke, don't fix it." > > > > Brian Pratt > > > > > > > ------------------------------------------------------------------------ > > *From:* psi...@li... > > [mailto:psi...@li...] *On Behalf Of > > *Pierre-Alain Binz > > *Sent:* Thursday, October 05, 2006 5:10 AM > > *To:* Randy Julian > > *Cc:* psi...@li... > > *Subject:* Re: [Psidev-ms-dev] FW: Why base64? > > > > I am for the possibility to represent a spectrum/peaklist/even > > chromatogram in more than one manner ONLY if these representations > > are easy and straighforward to generate and to parse AND if there > > is a good (or better blocking) reason to do so. We need to avoid > > optional things that make any implementation subject to > > interpretation and missunderstanding. > > So yes only if the two formats are strictly and clearly described > > and discriminated (specification issue) > > > > Pierre-Alain > > > > Randy Julian wrote: > >> There was concern in the NBT review of the mzData manuscript that > the format > >> was not able specifically designed for either quantitation or > 'raw' data. > >> Quite the opposite is true - it handles these better than it > handles a 'peak > >> list'. > >> > >> Given the broad scope we are going for, I think mzData 2.0 needs > to cover > >> both of Mike's suggestions. > >> > >> The representation should allow an ASCII list representation, > _and_ a base64 > >> list option. Within each of these, the _desired_ precision > should be used. > >> If you want to make some kind of 21CFR11 claim regarding GLP or > GCP for > >> clinical data (metabolites, proteins or biomarker analyses) then > the ability > >> to represent 'raw' data is critical and part of the current > >> design. > >> > >> It is the simple case of 'represent a single tandem MS spectrum > of a single > >> peptide at only the precision of the m/z calibration' that is > harder than it > >> needs to be with the current representation. > >> > >> During the Washington PSI meeting a proposal was made to > re-introduce the > >> ASCII data representation that was dropped at the PSI meeting in > Nice. What > >> does everyone think of this idea? > >> > >> Randy > >> > >> -----Original Message----- > >> From: psi...@li... > >> [mailto:psi...@li...] On Behalf Of > Mike > >> Coleman > >> Sent: Wednesday, October 04, 2006 3:13 PM > >> To: Angel Pizarro > >> Cc: Psi...@li... > >> Subject: Re: [Psidev-ms-dev] Why base64? > >> > >> [This message seems to have been bounced by Sourceforge, so I'm > >> resending it. I'm sorry to see that apparently they are having > >> serious email problems these days. See today's Slashdot article > at > >> http://it.slashdot.org/article.pl?sid=06/10/04/1324214. > (Apparently > >> the problem isn't limited to email coming from gmail accounts.) ] > >> > >> On 9/28/06, Mike Coleman <tu...@gm...> wrote: > >> > >>> Makes sense. To put it in other words, there are two questions > >>> here: > >>> > >>> 1. Are the values represented as base64-encoded bitstrings or > >>> as ASCII > >>> > >> text? > >> > >>> 2. Should the values be rounded to the precision of the > instrument > >>> (probably plus a digit, etc.), or should an arbitrary number of > >>> figures be used? Again, this isn't about losing information, as > we're > >>> only discussing rounding away noise. > >>> > >>> These two questions are entirely orthogonal, as far as I can > see, and > >>> it would be possible to allow both options for both questions, > if this > >>> were seen as being worthwhile. The one interaction is that if > you use > >>> the ASCII text encoding, rounding the figures will make the > mzData > >>> file smaller. > >>> > >>> Regarding ambiguity, the ASCII text representation would allow > >>> differing whitespace (which produce no semantic difference). I > guess > >>> the base64 encoding also allows differing surrounding > >>> whitespace. > >>> > >>> With respect to the base64 encoding, one corner case comes to > mind. > >>> Are special IEEE values like NaN, the infinities, negative zero, > etc., > >>> allowed? If so, what should the interpretation be? > >>> > >>> Mike > >>> > >>> > >>> The example code I mentioned: > >>> > >>> /* gcc -g -O2 -ffloat-store -o ieee-test ieee-test.c */ > >>> > >>> /* strtof is GNU/C99 */ > >>> #define _GNU_SOURCE > >>> > >>> #include <assert.h> > >>> #include <errno.h> > >>> #include <limits.h> > >>> #include <stdio.h> > >>> #include <stdlib.h> > >>> > >>> > >>> union bits { > >>> unsigned int u; > >>> float f; > >>> }; > >>> > >>> > >>> int > >>> main() { > >>> unsigned int i; > >>> union bits x, x2; > >>> int zeros_seen = 0; > >>> > >>> assert(sizeof x.u == sizeof x.f); > >>> assert(&x.u == &x.f); > >>> > >>> > >>> > >>> for (i=0; ; i++) { > >>> char buf[128]; > >>> > >>> if (i == 0) > >>> if (++zeros_seen > 1) > >>> break; > >>> > >>> #if 0 > >>> if (!(i % 100000)) > >>> putc('.', stderr); > >>> #endif > >>> > >>> x.u = i; > >>> if (x.f != x.f) > >>> continue; /* skip error values */ > >>> > >>> sprintf(buf, "%.8e", x.f); > >>> > >>> errno = 0; > >>> x2.f = strtof(buf, 0); > >>> if (errno == ERANGE) { > >>> printf("strtof error for %s\n", buf); > >>> continue; > >>> } > >>> > >>> if (x2.u != x.u) > >>> printf("bit difference for %s (%u != %u)\n", buf, x2.u, > x.u); > >>> } > >>> } > >>> > >>> > >> > >> > ------------------------------------------------------------------------ > - > >> Take Surveys. Earn Cash. Influence the Future of IT > >> Join SourceForge.net's Techsay panel and you'll get the chance to > share your > >> opinions on IT & business topics through brief surveys -- and > earn cash > >> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE > V > >> _______________________________________________ > >> Psidev-ms-dev mailing list > >> Psi...@li... > >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > >> > >> > >> > ------------------------------------------------------------------------ > - > >> Take Surveys. Earn Cash. Influence the Future of IT > >> Join SourceForge.net's Techsay panel and you'll get the chance to > share your > >> opinions on IT & business topics through brief surveys -- and > earn cash > >> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE > V > >> _______________________________________________ > >> Psidev-ms-dev mailing list > >> Psi...@li... > >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > >> > >> > > > > -- > > -- > > Dr. Pierre-Alain Binz > > Swiss Institute of Bioinformatics > > Proteome Informatics Group > > 1, Rue Michel Servet > > CH-1211 Geneve 4 > > Switzerland > > - - - - - - - - - - - - - - - - - > > Tel: +41-22-379 50 50 > > Fax: +41-22-379 58 58 > > Pie...@is... > > http://www.expasy.org/people/Pierre-Alain.Binz.html > > > > ---------------------------------------------------------------------- > > -- > > > > ---------------------------------------------------------------------- > > --- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > > opinions on IT & business topics through brief surveys -- and earn > cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE > V > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Psidev-ms-dev mailing list Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > -- > Angel Pizarro > Director, Bioinformatics Facility > Institute for Translational Medicine and Therapeutics University of > Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 > > P: 215-573-3736 > F: 215-573-9004 > E: an...@ma... > > > ------------------------------------------------------------------------ > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE > V > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Mike C. <tu...@gm...> - 2006-10-06 06:48:25
|
On 10/5/06, Brian Pratt <bri...@in...> wrote: >...the unsuitability of XML for eyeballing what is essentially columnar data, ... I do think "eyeballability" is important, but I also feel uneasy placing the key spectrum data beyond the reach of XML in an XML spectrum format. In essence, in the current version the XML encodes spectrum metadata--the peaks themselves become an afterthought, hidden away in a relatively inaccessible appendix. This would be easier to justify if this were image data, for which there is no reasonable textual representation. But in this case there is a trivial representation, and the code to read and write it is probably simpler than for the base64-encoded case. There's some discussion here http://c2.com/cgi/wiki?IsolateEachDatum that touches on this issue. Also, an example on that page suggests another possibility for the encoding of peaklists that I prefer to those discussed so far: <peaklist> <peak mz="234.56" i="789" /> <peak mz="3456.43" i="2" /> <peak mz="3457.22" i="234" /> </peaklist> This would have the virtue of being highly accessible to eyeball and quick-and-dirty scripts as well. It would also clearly compress well. And it keeps the peak data within the realm of XML. It would be conceivable, I think, to use XSLT to create a table of peak data or even an SVG image of the spectrum, for example, since everything would be living in XML-land. > ...A standard that provides n>1 ways > to state the same thing is n times as difficult to implement and maintain, > which reduces vendor enthusiasm by a factor of n (squared?), which hinders > widespread adoption. ... I generally agree with this, and in particular, I suspect that if the specification allowed both representations, possibly most vendors would only produce base64 output. For this reason, if the textual representation is preferred, maybe the base64 alternative should be deprecated and marked for removal in a future version. However, I think that there is still an advantage to having the textual alternative in the specification, even if instrument vendors never produce it. It would allow those of us who prefer the textual format to do convert to it in a standard way, in a way that coordinates with the mzData standard. Mike |
From: Randy J. <rkj...@in...> - 2006-10-06 13:48:48
|
In the mass spectrometry community there is a long history of building spectral databases which benefit from direct readability. Historically these have been plain ASCII representations including things like JCAMP-DX, etc. I think this list would agree that it would be better to use a HUPO format if for a peptide database. mzData could provide desirable additional instrument parameter information and provide a consistent mechanism for dealing with MS data across the proteomics community. To choose a numeric representation which causes groups like the NIST to use another format to receive and deliver data would be a loss. Instrument vendors are now providing exports to mzData, and I think it is critical that these exports be usable to submit data to mass spectral databases like those used by the MS community for years. If the cost is a little more code in the parser to deal with one more 'choice' element (of which we have many), then that seems small compared to the consequence of the NIST not being able to use the standard to deliver results to the community and thus requiring us to have a completely difference parser to read yet another MS format. Randy === Steve wrote: ... In our library, for example, we want the users to see the values that we put there, so we use ASCII. It would be very desirable for us if the same were offered in the XML's - otherwise we will have to go non-standard. ... -Steve Stein === Later Mike wrote: that touches on this issue. Also, an example on that page suggests another possibility for the encoding of peaklists that I prefer to those discussed so far: <peaklist> <peak mz="234.56" i="789" /> <peak mz="3456.43" i="2" /> <peak mz="3457.22" i="234" /> </peaklist> This would have the virtue of being highly accessible to eyeball and quick-and-dirty scripts as well. It would also clearly compress well. And it keeps the peak data within the realm of XML. It would be conceivable, I think, to use XSLT to create a table of peak data or even an SVG image of the spectrum, for example, since everything would be living in XML-land. > ...A standard that provides n>1 ways > to state the same thing is n times as difficult to implement and maintain, > which reduces vendor enthusiasm by a factor of n (squared?), which hinders > widespread adoption. ... I generally agree with this, and in particular, I suspect that if the specification allowed both representations, possibly most vendors would only produce base64 output. For this reason, if the textual representation is preferred, maybe the base64 alternative should be deprecated and marked for removal in a future version. However, I think that there is still an advantage to having the textual alternative in the specification, even if instrument vendors never produce it. It would allow those of us who prefer the textual format to do convert to it in a standard way, in a way that coordinates with the mzData standard. |
From: Tom B. <tb...@um...> - 2006-10-06 16:16:04
|
The comment was made this morning (fourth paragraph below): > If the cost is a little more code in the parser > to deal with one more 'choice' element (of which > we have many), then that seems small . . . To the contrary: a more important cost will be many applications which say they 'support' the mzData standard, but handle only one or other of the two alternate data representations. This has the potential for confusion among developers and users about what it means to support the standard. In an ideal world, all applications would support both . . . but in practice I fear that developers will implement only the branch they need. To me, the central question is one of community sociology: will it be clearer to the community to describe mzData as one standard containing alternatives -- or as two separate and possibly interoperable standards with separate names ? I think this is more important than the technical issues. I am extremely apprehensive about allowing alternate representations for the same information in a single standard. The value of having a standard data exchange format is to give each user the confidence that what is in a file, or what are the capabilities of an application are what he or she expects -- without checking in detail and without special-casing the data files from different sources. Simplicity and uniformity are key. With the greatest respect for all of the contributors, especially Randy Julian, I have to agree with Brian Pratt and Angel Pizarro on this point: to have ambiguity in the mzData standard at the level of allowing two alternate representations for the same information is effectively not to have a standard. Tom Blackwell University of Michigan Bioinformatics Ann Arbor, Michigan (I have appended Brian Pratt's and Lewis Geer's contributions from this morning below Randy Julian's email.) On Fri, 6 Oct 2006, Randy Julian wrote: > In the mass spectrometry community there is a long history of building > spectral databases which benefit from direct readability. > > Historically these have been plain ASCII representations including things > like JCAMP-DX, etc. I think this list would agree that it would be better > to use a HUPO format if for a peptide database. mzData could provide > desirable additional instrument parameter information and provide a > consistent mechanism for dealing with MS data across the proteomics > community. To choose a numeric representation which causes groups like the > NIST to use another format to receive and deliver data would be a loss. > > Instrument vendors are now providing exports to mzData, and I think it is > critical that these exports be usable to submit data to mass spectral > databases like those used by the MS community for years. > > If the cost is a little more code in the parser to deal with one more > 'choice' element (of which we have many), then that seems small compared to > the consequence of the NIST not being able to use the standard to deliver > results to the community and thus requiring us to have a completely > difference parser to read yet another MS format. > > Randy > > === > > Steve wrote: > > ... > > In our library, for example, we want the users to see the values that we > put there, so we use ASCII. It would be very desirable for us if the same > were offered in the XML's - otherwise we will have to go non-standard. > ... > > -Steve Stein > > === > > Later Mike wrote: > > that touches on this issue. Also, an example on that page suggests > another possibility for the encoding of peaklists that I prefer to > those discussed so far: > > <peaklist> > <peak mz="234.56" i="789" /> > <peak mz="3456.43" i="2" /> > <peak mz="3457.22" i="234" /> > </peaklist> > > This would have the virtue of being highly accessible to eyeball and > quick-and-dirty scripts as well. It would also clearly compress well. > And it keeps the peak data within the realm of XML. It would be > conceivable, I think, to use XSLT to create a table of peak data or > even an SVG image of the spectrum, for example, since everything would > be living in XML-land. > > >> ...A standard that provides n>1 ways >> to state the same thing is n times as difficult to implement and maintain, >> which reduces vendor enthusiasm by a factor of n (squared?), which hinders >> widespread adoption. ... > > > I generally agree with this, and in particular, I suspect that if the > specification allowed both representations, possibly most vendors > would only produce base64 output. For this reason, if the textual > representation is preferred, maybe the base64 alternative should be > deprecated and marked for removal in a future version. > > However, I think that there is still an advantage to having the > textual alternative in the specification, even if instrument vendors > never produce it. It would allow those of us who prefer the textual > format to do convert to it in a standard way, in a way that > coordinates with the mzData standard. > > > From bri...@in... Fri Oct 6 11:18:17 2006 > If one were to pursue the ASCII course then the structured approach Mike > presents is clearly the way to go. I still think it doesn't scale well, > though, and can't imagine the mass spec vendors actually writing such files. > To those on the thread saying "if there is a need for an eyeballable format, > let it be part of this standard instead of Yet Another standard", I heartily > agree. But when we talk of using XSLT to make peak tables, etc, well heck, > that's just more software translation and isn't really eyeballing, so why > mess with another format? > But... > It becomes apparent (or am I just slow to catch on?) that we may be > discussing two different ideas - I think Mike thinks of a "peak" as a > postprocessed idea, something coming out of a peak picking algorithm, > while others of us think of a "peak" as an m/z pair in an unprocessed > raw mass spec output (not deconvoluted, deisotoped, denoised, > de-anything-ed). Both are of interest, of course, but the latter isn't > really amenable to an ASCII representation due to its sheer bulk. > So maybe what we should be looking at is two different data elements, > each with its own represetation - and ASCII is arguably the right one > for a postprocessed peak pick list. > - Brian > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Mike Coleman > Sent: Thursday, October 05, 2006 11:48 PM > To: bri...@in... > Cc: psi...@li... > Subject: Re: [Psidev-ms-dev] FW: Why base64? > > On 10/5/06, Brian Pratt <bri...@in...> wrote: > >...the unsuitability of XML for eyeballing what is essentially > columnar data, ... > > I do think "eyeballability" is important, but I also feel uneasy > placing the key spectrum data beyond the reach of XML in an XML > spectrum format. In essence, in the current version the XML encodes > spectrum metadata--the peaks themselves become an afterthought, hidden > away in a relatively inaccessible appendix. > > This would be easier to justify if this were image data, for which > there is no reasonable textual representation. But in this case there > is a trivial representation, and the code to read and write it is > probably simpler than for the base64-encoded case. > > There's some discussion here > > http://c2.com/cgi/wiki?IsolateEachDatum > > that touches on this issue. Also, an example on that page suggests > another possibility for the encoding of peaklists that I prefer to > those discussed so far: > > <peaklist> > <peak mz="234.56" i="789" /> > <peak mz="3456.43" i="2" /> > <peak mz="3457.22" i="234" /> > </peaklist> > > This would have the virtue of being highly accessible to eyeball and > quick-and-dirty scripts as well. It would also clearly compress well. > And it keeps the peak data within the realm of XML. It would be > conceivable, I think, to use XSLT to create a table of peak data or > even an SVG image of the spectrum, for example, since everything would > be living in XML-land. > > > ...A standard that provides n>1 ways > > to state the same thing is n times as difficult to > implement and maintain, > > which reduces vendor enthusiasm by a factor of n > (squared?), which hinders > > widespread adoption. ... > > I generally agree with this, and in particular, I suspect that if the > specification allowed both representations, possibly most vendors > would only produce base64 output. For this reason, if the textual > representation is preferred, maybe the base64 alternative should be > deprecated and marked for removal in a future version. > > However, I think that there is still an advantage to having the > textual alternative in the specification, even if instrument vendors > never produce it. It would allow those of us who prefer the textual > format to do convert to it in a standard way, in a way that > coordinates with the mzData standard. > > Mike > > -------------------------------------------------------------- >From le...@nc... Fri Oct 6 10:28:08 2006 Date: Fri, 6 Oct 2006 10:27:46 -0400 From: "Geer, Lewis (NIH/NLM/NCBI) [E]" <le...@nc...> To: psi...@li... Subject: Re: [Psidev-ms-dev] FW: Why base64? Hi, I guess the general experience at NCBI is to make standards as flexible as possible while making them as explicit, easy to read, and validatible as possible. The pain of having multiple representations within the same standard is much less than the pain of having multiple standards, which can happen if a particular standard is too rigid. The "easy to read" requirement means by both machine and human -- human readable probably being the most important because of all of the endless debugging required when reading and writing files. It seems much more fun writing new applications than dealing with import/export code! Lewis |
From: Brian P. <bri...@in...> - 2006-10-06 15:18:12
|
If one were to pursue the ASCII course then the structured approach Mike presents is clearly the way to go. I still think it doesn't scale well, though, and can't imagine the mass spec vendors actually writing such files. To those on the thread saying "if there is a need for an eyeballable format, let it be part of this standard instead of Yet Another standard", I heartily agree. But when we talk of using XSLT to make peak tables, etc, well heck, that's just more software translation and isn't really eyeballing, so why mess with another format? But... It becomes apparent (or am I just slow to catch on?) that we may be discussing two different ideas - I think Mike thinks of a "peak" as a postprocessed idea, something coming out of a peak picking algorithm, while others of us think of a "peak" as an m/z pair in an unprocessed raw mass spec output (not deconvoluted, deisotoped, denoised, de-anything-ed). Both are of interest, of course, but the latter isn't really amenable to an ASCII representation due to its sheer bulk. So maybe what we should be looking at is two different data elements, each with its own represetation - and ASCII is arguably the right one for a postprocessed peak pick list. - Brian > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Mike Coleman > Sent: Thursday, October 05, 2006 11:48 PM > To: bri...@in... > Cc: psi...@li... > Subject: Re: [Psidev-ms-dev] FW: Why base64? > > On 10/5/06, Brian Pratt <bri...@in...> wrote: > >...the unsuitability of XML for eyeballing what is essentially > columnar data, ... > > I do think "eyeballability" is important, but I also feel uneasy > placing the key spectrum data beyond the reach of XML in an XML > spectrum format. In essence, in the current version the XML encodes > spectrum metadata--the peaks themselves become an afterthought, hidden > away in a relatively inaccessible appendix. > > This would be easier to justify if this were image data, for which > there is no reasonable textual representation. But in this case there > is a trivial representation, and the code to read and write it is > probably simpler than for the base64-encoded case. > > There's some discussion here > > http://c2.com/cgi/wiki?IsolateEachDatum > > that touches on this issue. Also, an example on that page suggests > another possibility for the encoding of peaklists that I prefer to > those discussed so far: > > <peaklist> > <peak mz="234.56" i="789" /> > <peak mz="3456.43" i="2" /> > <peak mz="3457.22" i="234" /> > </peaklist> > > This would have the virtue of being highly accessible to eyeball and > quick-and-dirty scripts as well. It would also clearly compress well. > And it keeps the peak data within the realm of XML. It would be > conceivable, I think, to use XSLT to create a table of peak data or > even an SVG image of the spectrum, for example, since everything would > be living in XML-land. > > > > ...A standard that provides n>1 ways > > to state the same thing is n times as difficult to > implement and maintain, > > which reduces vendor enthusiasm by a factor of n > (squared?), which hinders > > widespread adoption. ... > > I generally agree with this, and in particular, I suspect that if the > specification allowed both representations, possibly most vendors > would only produce base64 output. For this reason, if the textual > representation is preferred, maybe the base64 alternative should be > deprecated and marked for removal in a future version. > > However, I think that there is still an advantage to having the > textual alternative in the specification, even if instrument vendors > never produce it. It would allow those of us who prefer the textual > format to do convert to it in a standard way, in a way that > coordinates with the mzData standard. > > Mike > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge &CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |