From: Randy J. <rkj...@in...> - 2006-09-23 18:08:25
|
My thought on this is that we should generalize the "data" section to be a 'class' with a 'type'. Kent will go into this in much more detail in DC, but the basic idea, as started in the teleconferences, is that the instrument could be described as a process (protocols) with inputs and outputs (parameterSets). An output should be defined as a class which could take on any number of 'types' which can include all of the XSD data types if we so choose. Generalization like this means that we can define a protocol which takes the base64 data as an input parameter and produces a lossy text representation as an output. The parameters of this protocol and its description would tell us how the conversion was done and what the remaining precision is (significant figures, etc.). In the version of dataXML we will be discussing in DC, the input parameterSet for the above protocol could be another dataXML document with the base64 encoded spectra inside, located at a specified URL and the protocol producing the "peaklist" could simply 'refer' to the source input rather than duplicating it. By this method, the use case where we want a human readable peaklist available in text format derived from an original 'raw' instrument acquisition file located in some repository somewhere can be achieved with the 'peaklist' document containing the least amount of redundant information possible. If we get this right, an XQuery compatible link to the 'original' data can be made allowing almost RDF-like transversal of documents across repositories. The cost of this flexibility is a more abstract schema (which we will review in DC), and much heavier reliance on the ontology. The result does not look like XML, but like RDF implemented in XML. This is all hard to digest without specific examples, so those coming to DC should be prepared to work through their favorite use case to make sure it's all working they way we want. If it is, the good news it that the language bindings and therefore the utilities will come very fast since the API's can be generated directly off the base UML (with the mandatory prestidigitation). Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Angel Pizarro Sent: Friday, September 22, 2006 4:37 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] Why base64? Jimmy Eng wrote: > I believe base64 encoding makes more sense for some large class of > applications that will hopefully be digesting these files but I'm sure > everyone can see the obvious benefits of plain text encoding of peak > lists. > > The question I have is regarding the representation of space delimited > lists as Lewis and Randy have drawn up. Does this address the needs > of Michael, Steve, and Akhilesh and others? Hopefully they'll all > chime in. My concern would be that having a horizontal, space > separate list of numbers, where m/z and intensity will possibly be > written in separate lists of floats and ints, doesn't really serve the > notion of readability. Lots of folks are used to looking at lists of > peaks as ordered in .mgf or .dta files and I'm not sure if a > horizontal list of numbers (especially if it's 2 lists, one for m/z > and one for intensity) gives you that same sense of readability. I > don't really see any regular use case scenarios where people would be > scrolling over to the 68th m/z in the list and then somehow counting > over to the location of the 68th intensity to get its value. > > So _if_ this really doesn't address the needs of the folks who have > concerns about the base64 encoding and would like like to see plain > text, speak up. The last thing the format needs is more complexity > in the form of another optional way of representing the data that only > a handful of people will ever end up using. > > - Jimmy > > All excellent points. Let me see if I can recap the set of arguments: 1) For high-throughput and computational task, base64 encoding is fast, robust and reasonable with respect to size 2) Text formats are not useful unless they are formatted in an easily digestible fashion 3) Point #2 often conflicts with point #1 4) Ambiguity in a format is universally seen as a "bad thing" The best suggestion I could think of would be to just go ahead and officially endorse our current standard operating procedures. By this I mean first and foremost, that the official format be restricted to binary encoded data arrays. This is the format officially supported by hardware and software vendors. Second, that we endorse one of the /de facto/ plain text formats (MS2 or MGF) as the best way to encode plain text data, *and *(this is the important bit) the official PSI API's provide export to the endorsed plain text format. Notice that I didn't say import, since this operation is a lossy one, as covered in other posts. Or if we do provide import routines, they come with the large caveat that the transformation may have been lossy. The problem I see with this is that the I do not know if MS2 or MGF handle data other than MS2 or from multiple analyzers and detectors. They also generally have a much more restricted set of annotations good idea? bad idea? Something to discuss in DC at least... -angel ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |