Re: [Psidev-ms-dev] mzML 0.93 ready for first review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all.

It's verging on tangential, but in my group we recently drafted 
up a schema to index multi-omics data sets in three 
(omics-specific) repositories. That meant we had to have a 
concept of an 'assaying process' (shortened to assay). For a 
microarray that was basically a single hybridisation to one 
array; some chips come with four arrays on one slide = four 
assays; technical replicates = separate arrays. Basically the 
rationale was to address the smallest (atomic, if you like) unit.

For MS (proteomics first, but also metabolomics) we settled on 
one run = one assay for inline LCMS (i.e. a run lasting many 
minutes). This automatically gives offline a cardinalilty of one 
assay per (previously collected) fraction run.

For MALDI (the point of all this mumbling), we settled on one 
spot (no matter how many shots) = one assay, because although 
the various spots will likely have a connection, that isn't 
guaranteed. So one plate (if one were recording plates) links to 
n assays (in reality the fact that there was a set of spots on a 
particular plate is kind of ignorable but for QC and the like).

Where I see a further problem is for MS 'imaging' (i.e. shooting 
lots at tissue slices to map protein distributions) and isn't 
there a MALDI-like source that isn't discrete spots? Maybe I 
misremembered that one though... In either (the latter 
admittedly imaginary) case though I suspect all one could 
robustly consider as a 'run' or assay would be the analysis of 
one set of coordinates for the laser.

Don't know if that helps much  :)

Btw although I take the point that was made about tarballs, to 
play devil's advocate for a sec, the point of some of the 
standardisation stuff is to try to get people to move away from 
informal mechanisms (like use of folder trees or zips, ad hoc 
file naming 'formalisms' etc.) to associate files (/datasets). 
If the ability to combine multiple runs/assays in one mzML file 
_is_ more trouble than it is worth, a compromise might be to 
leverage whatever can be used to uniquely ID an mzML file (have 
LSIDs died yet or what?) to x-ref one file to another (i.e. 
insert a [0..*] element somewhere near the top with an attribute 
to hold one an external file ref and a sibling string attribute 
to hold a free text description)? Or, as was suggested, produce 
a wrapper schema. This should probably be a (lightweight) 
FuGE-based thing as it is all in there already (CPAS does this 
iirc, although with an earlier version of FuGE).

Cheers, Chris.

Eric Deutsch wrote:
>> From: psi...@li... [mailto:psidev-ms-dev-
>> bo...@li...] On Behalf Of Matthew Chambers
>>
>> What's wrong with the schema supporting multiple runs per file and letting
>> implementers gradually add support for it?  There are many features of
>> mzML that will require substantial rewrites of the existing parser APIs.
>> Parameter groups, multiple runs, multiple precursors, and compressed
>> binary data are all major "completely predictable trouble spots."  As long
>> as the file readers develop faster than the file writers, there won't be a
>> problem. ;)  I very much doubt that writers (e.g. ReAdW) will be writing
>> multiple instrument files into one mzML file any time soon (unless
>> somebody is itching to do this without saying so?).  The parameter groups
>> and multiple precursors are more problematic, IMO, but still good
>> improvements.
> 
> If I recall correct the feature of multiple runs crept in at the end of the Seattle meeting last fall. Can anyone articulate a compelling use case for multiple runs per file?  I seem to recall a scenario where at least one vendor encodes multiple runs in a single (wiff??) file, but I don't know about any of that for sure. Anyone have such a case?
> 
>> I have a few comments:
>>
>> - There seems to be a timestamp on the run element now (maybe I just
>> missed it before), of type xs:dateTime.  It's an optional attribute and it
>> has an ambiguous meaning.  Why isn't this expanded into a start and stop
>> timestamp for the run?  Also, why is it optional?
> 
> The believe the intent was that the timestamp is the UT at the start of the run. We should clarify this.
> 
> Is it useful to encode the stop timestamp?
> 
> As for why optional, we imagined that in the real world this value might not be known properly. Imagine a scenario were someone is converting a legacy mzXML file to mzML. This information may not be available, sadly. It is certainly encouraged that modern converters/writers include it.
> 
>> - Most every cvParam has a "cvLabel" attribute that is "MS" but the
>> accession attribute of each cvParam seems to include the cvLabel in it
>> ("MS:xxxxxxxx").  If that is just a coincidence, I think it should be
>> changed so that it is required and the cvLabel can be eliminated.  If it's
>> not a coincidence, why is like that?  If the parser needs to know which
>> vocabulary an accession number is from, it can parse until the colon
>> delimiter.  Alternatively, keeping cvLabel and getting rid of the "MS:" in
>> the accession attribute would allow somewhat more efficient parsing.  In
>> the alternative case, I suggest a required default cvLabel somewhere in
>> the header, similar to setting the default XML namespace.
> 
> cvLabel is really just an id to indicate which CV (as more completely defined within <cvList> above) the term comes from. It seems to be current best practice that life science CV accession numbers begin with an OBO namespace, :, and a number. But not all CVs will necessarily follow this convention as far as I'm aware.
> 
>> - I see a TODO item is giving the binaryDataArray's "dataType" attribute a
>> CV entry.  I agree with this.  But I think the values should be more
>> machine-oriented, like "float32", "float64", "int32", "uint64", etc.
> 
> You mean "float32" is preferred over "32-bit float"
> 
>> - Parameter groups are good, especially since the spectrum headers seem to
>> have ballooned to be more flexible.  Anything that makes the file-
>> dominating spectrum elements smaller and faster to parse is nice -
>> indexing the shared parameters is a good way to do this.
>>
>> - I'd still like to see a clear definition of "run" relative to "sample"
>> and "source file."  Seems like these three are all tightly coupled.
> 
> For LC-MSn ion trap data, this is relatively straightforward and is what is depicted in the examples. A run is a series of scans, usually counted consecutively by the instrument, obtained as a sample is injected into the instrument. A sample is the biomaterial that is injected into an instrument over a run.  The source file is the one or more files from which the mzML was generated. It will usually be a single vendor-format raw file. It could be an mzXML file. It could (unfortunately) be a series of dta files.
> 
> For MALDI or gel spot processing, however, this might be quite different. We had previously entertained "analyte identifier" or "MALDI spot identifier" CV terms to allow annotating each spot.  This might make the run-based sample undefined. I think LC-MSn heavily colored our thinking during development. It would be extremely nice to have a detailed example of data where individual scan refer to different "analyte identifiers" or the like. Would someone contribute this?
> 
> Regards,
> Eric
> 
>>
>> -Matt Chambers
>>
>> Vanderbilt MSRC
>>
>>
>>
>> ________________________________
>>
>> From: psi...@li... [mailto:psidev-ms-dev-
>> bo...@li...] On Behalf Of Brian Pratt
>> Sent: Thursday, August 02, 2007 2:47 PM
>> To: psi...@li...
>> Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review
>>
>>
>>
>> (Note: I know I'm late to the party with this comment, but I think it's
>> important)
>>
>>
>>
>> I noticed this in the todo file:
>>
>> " - Now that we're allowing multiple runs in a file, how will the index
>> look to handle this?"
>>
>>
>>
>> Better question: what will software that uses such an index look like?
>>
>>
>>
>> Answer: it won't look much like anything that currently reads mzXML and
>> mzData - including X!Tandem or anything using RAMP (TPP and others) or
>> JRAP (CPAS and others).  These programs easily deal with both mzData and
>> mzXML in their various versions by using APIs which, as it happens, assume
>> one file per run and one run per file.   Breaking this one to one
>> correspondence in mzML means you can't just slide mzML support in behind
>> the API, and of course also violates a fundamental assumption which flows
>> through the code that calls these APIs, right out to the user interface in
>> most cases.  This means extensive surgery to any program that wants to
>> read mzML properly, and my guess is that means mzML is DOA.  At a minimum
>> it becomes a completely predictable trouble spot since you can now write
>> legal mzML files that the majority of mzML readers will simply not know
>> how to handle.   They'll be OK with RunList::count == 1, but no more - so,
>> why set ourselves up for trouble?
>>
>>
>>
>> Multiple runs per file are probably useful in some cases, but if the
>> stated goal of mzML is to replace mzXML and mzData then I think this
>> feature is actually scope creep which threatens the mission and should be
>> dropped.  Let those who really want this feature come up with a wrapper
>> schema, but don't call it mzML lest you force the vast majority of mzML
>> consuming software to be broken from the start.
>>
>>
>>
>> - Brian
>>
>>
>>
>> ________________________________
>>
>> From: psi...@li... [mailto:psidev-ms-dev-
>> bo...@li...] On Behalf Of Eric Deutsch
>> Sent: Thursday, August 02, 2007 1:02 AM
>> To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet Souda;
>> Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; Andy
>> Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo;
>> Jam...@wa...; Mike Coleman; Matthew Chambers; Helen Jenkins;
>> Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas Römpp; Kent Laursen;
>> Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick Gino
>> Angelo; Hans Vissers; Eric Deutsch; cl...@br...;
>> dav...@ag...; rb...@be...; psidev-ms-
>> de...@li...
>> Cc: Rolf Apweiler; Ruedi Aebersold
>> Subject: [Psidev-ms-dev] mzML 0.93 ready for first review
>>
>> Hi everyone, after considerable hard work from many people, we have a
>> prerelease of mzML (the union of mzData and mzXML) available for comment
>> by you, a major stakeholder in mzML.
>>
>> You may download a kit of material to examine at:
>>
>> http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip
>>
>> The general mzML development page is at:
>>
>> http://psidev.info/index.php?q=node/257
>>
>> Please send feedback to:
>>
>> psi...@li...
>>
>> We ask that you respond by August 20.
>>
>> Additional releases with more information may be provided during the
>> coming month.
>>
>> The current format has been guided by these principles:
>>
>> - Keep the format simple
>>
>> - Minimize alternate ways of encoding the same information
>>
>> - Allow some flexibility for encoding new important information
>>
>> - Support the features of mzData and mzXML but not a lot more
>>
>> - But do provide clear support for SRM data
>>
>> - Finish the format soon with the resources available
>>
>> There are many enhancements that have been suggested, but the small group
>> of volunteers that have actively developed this format have opted to focus
>> on the primary goal set before us: develop a single format that the
>> vendors and current software can easily support and thereby obsolete
>> mzData and mzXML. The enhancements not considered compatible with this
>> goal will be entertained for mzML 2.0
>>
>> We are committed to providing not just the format, but also a set of
>> working implementations, converters and readers, as well as a format
>> validator, all to ensure that mzML is a format that will be adopted
>> quickly and implemented uniformly. Prior to submission to the PSI document
>> process, the following software will implement mzML:
>>
>> - 2 or more converters from vendor formats to mzML
>>
>> - the popular reader library RAMP that currently supports mzData and mzXML
>>
>> - an mzML semantic validator that checks for correct implementation
>>
>> We hope to follow this schedule:
>>
>> 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment
>>
>> 2007-08-20 Comments from major stakeholders received
>>
>> 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning
>> 30 days internal review
>>
>> 2007-10-01 Revised mzML 1.01 begins 60 days community review
>>
>> 2007-10-06 Formal announcement that feedback is sought at HUPO world
>> congress
>>
>> 2007-12-01 Formal 60 days community review closes
>>
>> 2008-01-01 Revised mzML 1.02 officially released
>>
>> Thank you for your help! Feel free to forward this message to someone whom
>> you think should review the format at this stage.
>>
>> Regards,
>>
>> Eric
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~
  chr...@eb...
  http://mibbi.sf.net/
~~~~~~~~~~~~~~~~~~~~~~~~