Re: [Psidev-ms-dev] Nailing down NativeID

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Eric,

Of course, sorry I should have realized that the axis name concept would 
confuse matters. The axis names are just there so that a machine reading 
the format specification can associate each comma delimited section 
(what I'm calling an "axis") with a logical name.

Thermo:
0,1 (controller 0, scan 1)
0,2
0,3
1,1 (controller 1, scan 1)

Waters:
1,0,1 (function 1, process 0, scan 1)
1,0,2
1,0,3
2,0,1 (function 2, process 0, scan 1)
2,0,2
2,0,3

WIFF:
0,1,1,2 (sample 0, period 1, cycle 1, experiment 2)
0,1,1,3
0,1,2,2
0,1,2,3
0,1,2,4
0,1,3,2
0,1,3,3
0,1,3,2
0,1,4,2
1,1,1,2
1,1,1,3

When a machine reads the WIFF definition, it will know that the fields 
mean (in order) "sample #", "period #", "cycle #", "experiment #". The 
detailed meaning of those names won't be covered by the format 
definition, but it's conceivable that we define those names in detail as 
separate CV terms. Remember the main idea for nativeID is to map a 
spectrum back to a source file in a way that is more intuitive than a 
simple index, so being able to use them to look up the spectrum via a 
native interface is important.

I think we can safely require that the nativeIDs always use all the 
fields even if for an entire run all of a particular axis has the same 
value. For example, in Thermo data the controller number is almost 
always going to be the number corresponding with the MS controller 
(although the actual number is not guaranteed to be 0). For backwards 
compatibility with tools which expect Thermo ids to be scan numbers with 
an implicit assumption about the controller, it is very reasonable to 
require those tools to simply parse the id. Parsing a comma-delimited 
pair is far easier than all the other crap one must do to get proper 
mzML support. ;) In particular for you Eric and other TPP users, the 
RAMP adapter that pwiz uses will pass only the scan number (and make 
sure the spectrum is a mass spectrum).

-Matt

Eric Deutsch wrote:
> Hi Matt, thanks, this looks well thought out, although I'm not sure I
> fully understand the syntax you're proposing. Can you provide one or two
> examples of each type?
>
> Thanks!
> Eric
>
>
>   
>> -----Original Message-----
>> From: psi...@li...
>>     
> [mailto:psidev-ms-dev-
>   
>> bo...@li...] On Behalf Of Matthew Chambers
>> Sent: Tuesday, July 22, 2008 3:15 PM
>> To: Mass spectrometry standard development
>> Subject: [Psidev-ms-dev] Nailing down NativeID
>>
>> Hi all,
>>
>> I think it's overdue that we get this part of mzML formally specified
>>     
> -
>   
>> at least for the vendors and generic formats. I am proposing a draft
>>     
> of
>   
>> nativeID formats, the place to put the formats in the specification
>> documents, and to have mzML instance documents explicitly reference
>>     
> the
>   
>> format they are using. This explicit reference should be required for
>> semantic validation, but I'd also recommend that mzML readers that
>>     
> don't
>   
>> find or ignore the nativeID format term specified simply treat the
>> nativeID as a free string (rendering it pretty useless, but at least
>> there would be a defined way to handle it). The terms would be placed
>>     
> in
>   
>> the fileContent element to define the format for all nativeIDs in the
>> file.
>>
>> I propose that the nativeID formats become CV terms, and that the term
>> definitions define the formats unambiguously in a machine-readable way
>> that a semantic validator can use to validate the nativeIDs.  I will
>> list my format drafts in OBO format. Each specific native format
>> definition is a comma-delimited list of key-value pairs, where the key
>> is the axis name (e.g. "scan number") and the value specifies the
>>     
> format
>   
>> of the axis in one of two ways:
>> 1) a Perl-style regular expression that can provide semantic/logical
>> choices for strings (e.g. "controller type" can be either "MS" or
>>     
> "PDA"
>   
>> or "UV" etc.)
>> 2) an XSD type that can specify unrestricted strings or a numeric type
>> (possibly with semantic restrictions)
>>
>> I didn't actually need to use a regex for any of the formats below,
>>     
> but
>   
>> I can see their usefulness. For example, they would be needed if I'm
>> wrong about Xcalibur and it makes more sense for Thermo spectra to use
>> controller names instead of controller numbers.
>>
>> Obviously the syntax of the format definitions is flexible if people
>> have better ideas (ideally one that could combine the power of regex
>>     
> and
>   
>> XSD; "infinite cosmic power, itty bitty living space!").
>>
>> [Term]
>> id: MS:x
>> name: native spectrum identifier
>> def: "References a spectrum in a native (non-mzML) spectrum source
>> according to a strict format. The format is dependent on the type of
>>     
> the
>   
>> spectra source." [PSI:MS]
>> is_a: MS:1000524 ! data file content
>>
>> [Term]
>> id: MS:x
>> name: native chromatogram identifier
>> def: "References a chromatogram in a native (non-mzML) chromatogram
>> source according to a strict format. The format is dependent on the
>>     
> type
>   
>> of the chromatogram source." [PSI:MS]
>> is_a: MS:1000524 ! data file content
>> ! note: I don't have any instances of native chromatogram identifiers,
>> but I can conceive of the possibilities!
>>
>> [Term]
>> id: MS:x
>> name: Thermo RAW spectrum identifier
>> def: "controller type=xsd:nonNegativeInteger,scan
>> number=xsd:positiveInteger" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note to Jim: apparently, Xcalibur can handle multiple controllers of
>> the same type, so is a choice between strings still appropriate?
>>
>> [Term]
>> id: MS:x
>> name: Waters RAW spectrum identifier
>> def: "function number=xsd:positiveInteger,process
>> number=xsd:nonNegativeInteger,scan number=xsd:positiveInteger"
>>     
> [PSI:MS]
>   
>> is_a: MS:x ! native spectrum identifier
>> ! note: is process number ever non-zero?
>>
>> [Term]
>> id: MS:x
>> name: WIFF spectrum identifier
>> def: "sample number=xsd:nonNegativeInteger,period
>> number=xsd:positiveInteger,cycle number=xsd:positiveInteger,experiment
>> number=xsd:positiveInteger" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> [Term]
>> id: MS:x
>> name: ABI Oracle database spectrum identifier
>> def: "" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note: need expertise here; alternatively, we could lump these
>>     
> spectra
>   
>> in with DTA/PKL nativeIDs (see below) when they are extracted to T2Ds
>>
>> [Term]
>> id: MS:x
>> name: Bruker spectrum identifier
>> def: "" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note: need expertise here. AFAIK, each Bruker YEP/BAF/FID spectrum
>>     
> is
>   
>> natively a single file, so that seems to make nativeID irrelevant and
>> sourceFile[Ref] critical
>>
>> [Term]
>> id: MS:x
>> name: Shimadzu spectrum identifier
>> def: "" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note: need expertise here
>>
>> [Term]
>> id: MS:x
>> name: MGF spectrum identifier
>> def: "index=xsd:nonNegativeInteger" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note: TITLE attributes are optional, so the index into the file is
>>     
> the
>   
>> only reliable source (TITLE can be used for the string id if present)
>>
>> [Term]
>> id: MS:x
>> name: mzData/mzXML/MS2 spectrum identifier
>> def: "scan number=xsd:positiveInteger" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> [Term]
>> id: MS:x
>> name: PKL/DTA spectrum identifier
>> def: "" [PSI:MS]
>> is_a: MS:x ! native spectrum identifier
>> ! note: like Bruker, a PKL or DTA could be standalone so AFAIK the
>>     
> only
>   
>> way to reliably reference it is via sourceFileRef
>>
>>
>>     
>