Re: [Psidev-ms-dev] FW: Why base64?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The comment was made this morning (fourth paragraph below):

       > If the cost is a little more code in the parser
       > to deal with one more 'choice' element (of which
       > we have many), then that seems small  . . .

To the contrary:  a more important cost will be many applications 
which say they 'support' the mzData standard, but handle only one 
or other of the two alternate data representations.  This has the 
potential for confusion among developers and users about what it 
means to support the standard.  In an ideal world, all applications 
would support both  . . .  but in practice I fear that developers 
will implement only the branch they need.

To me, the central question is one of community sociology:  will 
it be clearer to the community to describe mzData as one standard 
containing alternatives -- or as two separate and possibly 
interoperable standards with separate names ?  I think this is 
more important than the technical issues.

I am extremely apprehensive about allowing alternate representations 
for the same information in a single standard.  The value of having 
a standard data exchange format is to give each user the confidence 
that what is in a file, or what are the capabilities of an application 
are what he or she expects -- without checking in detail and without 
special-casing the data files from different sources.  Simplicity 
and uniformity are key.  With the greatest respect for all of the 
contributors, especially Randy Julian, I have to agree with Brian 
Pratt and Angel Pizarro on this point:  to have ambiguity in the 
mzData standard at the level of allowing two alternate representations 
for the same information is effectively not to have a standard.

Tom Blackwell
University of Michigan Bioinformatics
Ann Arbor, Michigan

    (I have appended Brian Pratt's and Lewis Geer's contributions
     from this morning below Randy Julian's email.)

On Fri, 6 Oct 2006, Randy Julian wrote:

> In the mass spectrometry community there is a long history of building 
> spectral databases which benefit from direct readability.
>
> Historically these have been plain ASCII representations including things 
> like JCAMP-DX, etc.  I think this list would agree that it would be better 
> to use a HUPO format if for a peptide database.  mzData could provide 
> desirable additional instrument parameter information and provide a 
> consistent mechanism for dealing with MS data across the proteomics 
> community.  To choose a numeric representation which causes groups like the 
> NIST to use another format to receive and deliver data would be a loss.
>
> Instrument vendors are now providing exports to mzData, and I think it is 
> critical that these exports be usable to submit data to mass spectral 
> databases like those used by the MS community for years.
>
> If the cost is a little more code in the parser to deal with one more 
> 'choice' element (of which we have many), then that seems small compared to 
> the consequence of the NIST not being able to use the standard to deliver 
> results to the community and thus requiring us to have a completely 
> difference parser to read yet another MS format.
>
> Randy
>
> ===
>
> Steve wrote:
>
> ...
> 
> In our library, for example, we want the users to see the values that we 
> put there, so we use ASCII. It would be very desirable for us if the same 
> were offered in the XML's - otherwise we will have to go non-standard.
> ...
>
> -Steve Stein
>
> ===
>
> Later Mike wrote:
>
> that touches on this issue.  Also, an example on that page suggests 
> another possibility for the encoding of peaklists that I prefer to 
> those discussed so far:
>
> <peaklist>
>  <peak mz="234.56" i="789" />
>  <peak mz="3456.43" i="2" />
>  <peak mz="3457.22" i="234" />
> </peaklist>
>
> This would have the virtue of being highly accessible to eyeball and 
> quick-and-dirty scripts as well.  It would also clearly compress well. 
> And it keeps the peak data within the realm of XML.  It would be 
> conceivable, I think, to use XSLT to create a table of peak data or 
> even an SVG image of the spectrum, for example, since everything would 
> be living in XML-land.
>
>
>> ...A standard that provides n>1 ways 
>> to state the same thing is n times as difficult to implement and maintain, 
>> which reduces vendor enthusiasm by a factor of n (squared?), which hinders 
>> widespread adoption. ...
>
>
> I generally agree with this, and in particular, I suspect that if the 
> specification allowed both representations, possibly most vendors 
> would only produce base64 output.  For this reason, if the textual 
> representation is preferred, maybe the base64 alternative should be 
> deprecated and marked for removal in a future version.
>
> However, I think that there is still an advantage to having the 
> textual alternative in the specification, even if instrument vendors 
> never produce it.  It would allow those of us who prefer the textual 
> format to do convert to it in a standard way, in a way that 
> coordinates with the mzData standard.
>
>

> From bri...@in... Fri Oct  6 11:18:17 2006

> If one were to pursue the ASCII course then the structured approach Mike 
> presents is clearly the way to go.  I still think it doesn't scale well, 
> though, and can't imagine the mass spec vendors actually writing such files.

> To those on the thread saying "if there is a need for an eyeballable format, 
> let it be part of this standard instead of Yet Another standard", I heartily 
> agree.  But when we talk of using XSLT to make peak tables, etc, well heck, 
> that's just more software translation and isn't really eyeballing, so why 
> mess with another format?

> But...

> It becomes apparent (or am I just slow to catch on?) that we may be 
> discussing two different ideas - I think Mike thinks of a "peak" as a 
> postprocessed idea, something coming out of a peak picking algorithm, 
> while others of us think of a "peak" as an m/z pair in an unprocessed 
> raw mass spec output (not deconvoluted, deisotoped, denoised, 
> de-anything-ed).  Both are of interest, of course, but the latter isn't 
> really amenable to an ASCII representation due to its sheer bulk. 
> So maybe what we should be looking at is two different data elements, 
> each with its own represetation - and ASCII is arguably the right one 
> for a postprocessed peak pick list.

> - Brian

> -----Original Message-----
> From: psi...@li... 
> [mailto:psi...@li...] On 
> Behalf Of Mike Coleman
> Sent: Thursday, October 05, 2006 11:48 PM
> To: bri...@in...
> Cc: psi...@li...
> Subject: Re: [Psidev-ms-dev] FW: Why base64?
> 
> On 10/5/06, Brian Pratt <bri...@in...> wrote:
> >...the unsuitability of XML for eyeballing what is essentially
> columnar data, ...
> 
> I do think "eyeballability" is important, but I also feel uneasy
> placing the key spectrum data beyond the reach of XML in an XML
> spectrum format.  In essence, in the current version the XML encodes
> spectrum metadata--the peaks themselves become an afterthought, hidden
> away in a relatively inaccessible appendix.
> 
> This would be easier to justify if this were image data, for which
> there is no reasonable textual representation.  But in this case there
> is a trivial representation, and the code to read and write it is
> probably simpler than for the base64-encoded case.
> 
> There's some discussion here
>
>     http://c2.com/cgi/wiki?IsolateEachDatum
> 
> that touches on this issue.  Also, an example on that page suggests
> another possibility for the encoding of peaklists that I prefer to
> those discussed so far:
> 
> <peaklist>
>   <peak mz="234.56" i="789" />
>   <peak mz="3456.43" i="2" />
>   <peak mz="3457.22" i="234" />
> </peaklist>
> 
> This would have the virtue of being highly accessible to eyeball and
> quick-and-dirty scripts as well.  It would also clearly compress well.
>  And it keeps the peak data within the realm of XML.  It would be
> conceivable, I think, to use XSLT to create a table of peak data or
> even an SVG image of the spectrum, for example, since everything would
> be living in XML-land.
> 
> > ...A standard that provides n>1 ways
> > to state the same thing is n times as difficult to 
> implement and maintain,
> > which reduces vendor enthusiasm by a factor of n 
> (squared?), which hinders
> > widespread adoption. ...
> 
> I generally agree with this, and in particular, I suspect that if the
> specification allowed both representations, possibly most vendors
> would only produce base64 output.  For this reason, if the textual
> representation is preferred, maybe the base64 alternative should be
> deprecated and marked for removal in a future version.
> 
> However, I think that there is still an advantage to having the
> textual alternative in the specification, even if instrument vendors
> never produce it.  It would allow those of us who prefer the textual
> format to do convert to it in a standard way, in a way that
> coordinates with the mzData standard.
> 
> Mike
> 
> --------------------------------------------------------------

>From le...@nc... Fri Oct  6 10:28:08 2006
Date: Fri, 6 Oct 2006 10:27:46 -0400
From: "Geer, Lewis (NIH/NLM/NCBI) [E]" <le...@nc...>
To: psi...@li...
Subject: Re: [Psidev-ms-dev] FW:  Why base64?

Hi,

I guess the general experience at NCBI is to make standards as flexible
as possible while making them as explicit, easy to read, and validatible
as possible.  The pain of having multiple representations within the
same standard is much less than the pain of having multiple standards,
which can happen if a particular standard is too rigid.

The "easy to read" requirement means by both machine and human -- human
readable probably being the most important because of all of the endless
debugging required when reading and writing files.  It seems much more
fun writing new applications than dealing with import/export code!

Lewis