Re: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

My thought on this is that we should generalize the "data" section to be a
'class' with a 'type'.

Kent will go into this in much more detail in DC, but the basic idea, as
started in the teleconferences, is that the instrument could be described as
a process (protocols) with inputs and outputs (parameterSets).  An output
should be defined as a class which could take on any number of 'types' which
can include all of the XSD data types if we so choose.

Generalization like this means that we can define a protocol which takes the
base64 data as an input parameter and produces a lossy text representation
as an output.  The parameters of this protocol and its description would
tell us how the conversion was done and what the remaining precision is
(significant figures, etc.).

In the version of dataXML we will be discussing in DC, the input
parameterSet for the above protocol could be another dataXML document with
the base64 encoded spectra inside, located at a specified URL and the
protocol producing the "peaklist" could simply 'refer' to the source input
rather than duplicating it.  By this method, the use case where we want a
human readable peaklist available in text format derived from an original
'raw' instrument acquisition file located in some repository somewhere can
be achieved with the 'peaklist' document containing the least amount of
redundant information possible.  If we get this right, an XQuery compatible
link to the 'original' data can be made allowing almost RDF-like transversal
of documents across repositories.

The cost of this flexibility is a more abstract schema (which we will review
in DC), and much heavier reliance on the ontology.  The result does not look
like XML, but like RDF implemented in XML.

This is all hard to digest without specific examples, so those coming to DC
should be prepared to work through their favorite use case to make sure it's
all working they way we want.  If it is, the good news it that the language
bindings and therefore the utilities will come very fast since the API's can
be generated directly off the base UML (with the mandatory
prestidigitation).

Randy

-----Original Message-----
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Angel
Pizarro
Sent: Friday, September 22, 2006 4:37 PM
To: psi...@li...
Subject: Re: [Psidev-ms-dev] Why base64?

Jimmy Eng wrote:
> I believe base64 encoding makes more sense for some large class of
> applications that will hopefully be digesting these files but I'm sure
> everyone can see the obvious benefits of plain text encoding of peak
> lists.
>
> The question I have is regarding the representation of space delimited
> lists as Lewis and Randy have drawn up.  Does this address the needs
> of Michael, Steve, and Akhilesh and others?  Hopefully they'll all
> chime in.  My concern would be that having a horizontal, space
> separate list of numbers, where m/z and intensity will possibly be
> written in separate lists of floats and ints, doesn't really serve the
> notion of readability.  Lots of folks are used to looking at lists of
> peaks as ordered in .mgf or .dta files and I'm not sure if a
> horizontal list  of numbers (especially if it's 2 lists, one for m/z
> and one for intensity) gives you that same sense of readability.  I
> don't really see any regular use case scenarios where people would be
> scrolling over to the 68th m/z in the list and then somehow counting
> over to the location of the 68th intensity to get its value.
>
> So _if_ this really doesn't address the needs of the folks who have
> concerns about the base64 encoding and would like like to see plain
> text, speak up.  The last thing the format needs is  more complexity
> in the form of another optional way of representing the data that only
> a handful of people will ever end up using.
>
> - Jimmy
>
>   

All excellent points. Let me see if I can recap the set of arguments:

1) For high-throughput and computational task, base64 encoding is fast, 
robust and reasonable with respect to size
2) Text formats are not useful unless they are formatted in an easily 
digestible fashion
3) Point #2 often conflicts with point #1
4) Ambiguity in a format is universally seen as a "bad thing"

The best suggestion I could think of would be to just go ahead and 
officially endorse our current standard operating procedures.

By this I mean first and foremost, that the official format be 
restricted to binary encoded data arrays. This is the format officially 
supported by hardware and software vendors. Second, that we endorse one 
of the /de facto/ plain text formats (MS2 or MGF) as the best way to 
encode plain text data, *and *(this is the important bit) the official 
PSI API's provide export to the endorsed plain text format. Notice that 
I didn't say import, since this operation is a lossy one, as covered in 
other posts. Or if we do provide import routines, they come with the 
large caveat that the transformation may have been lossy.

The problem I see with this is that the I do not know if MS2 or MGF 
handle data other than MS2 or from multiple analyzers and detectors. 
They also generally have a much more restricted set of annotations

good idea? bad idea? Something to discuss in DC at least...

-angel

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Psidev-ms-dev mailing list
Psi...@li...
https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev