Re: [Psidev-ms-dev] Nailing down NativeID

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I prefer nativeIDs without the labels. Labels work better and can be 
verbose in the arbitrary string 'id'; nativeID is provided primarily for 
machine readability and guaranteed formatting so to me it just makes 
more sense to "KISS" (keep it small and simple). :)

Since the two types of ids co-exist, human interpretation of the 
nativeID is not an issue.

This is good discussion though, we just need more of it - even it's a 
simple assent to the proposal (or the alternatives). :)

Thanks,
-Matt

Darren Kessner wrote:
> I think Fredrik has good points, and I like his idea of using short  
> labels.
>
> An alternative to consider is 3-4 letter abbreviations (using Matt's  
> examples):
>
> Thermo:
> "con0 scan1"
> "scan2"
>
> Waters:
> "fun1 proc0 scan1"
>
> WIFF:
> "sam0 per1 cyc1 exp2"
>
>
> Darren
>
>
> On Sep 18, 2008, at 12:18 PM, Fredrik Levander wrote:
>
>   
>> Hi Matt,
>>
>> I agree that the Native ID is a very important feature of the format  
>> and
>> that it needs to be settled. Your solution is elegant, I can see two
>> disadvantages though:
>> 1) It is not straightforward to intepret the nativeID by visual
>> inspection, since you need to look in the CV to find out what order  
>> the
>> numbers are in.
>> 2) If the number in one axis is unknown or irrelevant for the setup,  
>> it
>> could be a problem to have it as required. One could imagine just
>> specifying an empty field instead of a number in that situation  
>> though.
>>
>> An alternative is to have reserved characters in the native id:
>> S = scan
>> F = function
>> C = controller
>> P = process
>> Cy (or maybe Y) = Cycle
>> E = Experiment
>> Pe = Period
>> Other reserved letters can be added as needed.
>>
>> Then one can specify these as required for the instrumental setup.
>> Scan 1 would be "S1"
>> Function1, Scan 1 would be "F1S1" or "S1F1" or "S1,F1", the later if
>> comma separation is wanted.
>> If a certain order of the axes is wanted this can be imposed by regex.
>> A problem with this solution could be if an axis needs to contain
>> letters instead of numbers, but it is doable, at least with comma
>> separation.
>>
>> A combination of the CV approach and initiating letters could maybe  
>> also
>> be an alternative:
>>
>> [Term]
>> id: MS:x
>> name: Waters RAW spectrum identifier
>> def: "F:function number=xsd:positiveInteger (optional),P:process
>> number=xsd:nonNegativeInteger (optional),S:scan  
>> number=xsd:positiveInteger"
>>
>> Valid nativeIDs are: "F1,S1" and "F1,P1,S1", but not "F1"
>>
>> It would be good to have some input on what is required to report  
>> for the rest of the vendor instruments too,    but I think the  
>> nativeID format should be settled soon.
>>
>> Fredrik
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Matthew Chambers skrev:
>>     
>>> It's been 4 months since we released the format and we still can't  
>>> point
>>> implementors to documentation specifying what nativeIDs must look  
>>> like.
>>> Can we please comment on my proposal or get other proposals to  
>>> discuss?
>>> I am not averse to initially leaving out the terms that I couldn't  
>>> come
>>> up with well-defined formats for (Bruker, PKL, ABI Oracle, Shimadzu).
>>>
>>> -Matt
>>>
>>>
>>> -------- Original Message --------
>>> Subject: 	Re: [Psidev-ms-dev] Nailing down NativeID
>>> Date: 	Tue, 22 Jul 2008 21:28:34 -0500
>>> From: 	Matt Chambers <mat...@va...>
>>> Reply-To: 	Mass spectrometry standard development
>>> <psi...@li...>
>>> To: 	Mass spectrometry standard development
>>> <psi...@li...>
>>> References: 	<488...@va...>
>>> <5BE...@he...>
>>>
>>>
>>>
>>> Hi Eric,
>>>
>>> Of course, sorry I should have realized that the axis name concept  
>>> would
>>> confuse matters. The axis names are just there so that a machine  
>>> reading
>>> the format specification can associate each comma delimited section
>>> (what I'm calling an "axis") with a logical name.
>>>
>>> Thermo:
>>> 0,1 (controller 0, scan 1)
>>> 0,2
>>> 0,3
>>> 1,1 (controller 1, scan 1)
>>>
>>> Waters:
>>> 1,0,1 (function 1, process 0, scan 1)
>>> 1,0,2
>>> 1,0,3
>>> 2,0,1 (function 2, process 0, scan 1)
>>> 2,0,2
>>> 2,0,3
>>>
>>> WIFF:
>>> 0,1,1,2 (sample 0, period 1, cycle 1, experiment 2)
>>> 0,1,1,3
>>> 0,1,2,2
>>> 0,1,2,3
>>> 0,1,2,4
>>> 0,1,3,2
>>> 0,1,3,3
>>> 0,1,3,2
>>> 0,1,4,2
>>> 1,1,1,2
>>> 1,1,1,3
>>>
>>> When a machine reads the WIFF definition, it will know that the  
>>> fields
>>> mean (in order) "sample #", "period #", "cycle #", "experiment #".  
>>> The
>>> detailed meaning of those names won't be covered by the format
>>> definition, but it's conceivable that we define those names in  
>>> detail as
>>> separate CV terms. Remember the main idea for nativeID is to map a
>>> spectrum back to a source file in a way that is more intuitive than a
>>> simple index, so being able to use them to look up the spectrum via a
>>> native interface is important.
>>>
>>> I think we can safely require that the nativeIDs always use all the
>>> fields even if for an entire run all of a particular axis has the  
>>> same
>>> value. For example, in Thermo data the controller number is almost
>>> always going to be the number corresponding with the MS controller
>>> (although the actual number is not guaranteed to be 0). For backwards
>>> compatibility with tools which expect Thermo ids to be scan numbers  
>>> with
>>> an implicit assumption about the controller, it is very reasonable to
>>> require those tools to simply parse the id. Parsing a comma-delimited
>>> pair is far easier than all the other crap one must do to get proper
>>> mzML support. ;) In particular for you Eric and other TPP users, the
>>> RAMP adapter that pwiz uses will pass only the scan number (and make
>>> sure the spectrum is a mass spectrum).
>>>
>>> -Matt
>>>
>>>
>>> Eric Deutsch wrote:
>>>
>>>       
>>>> Hi Matt, thanks, this looks well thought out, although I'm not  
>>>> sure I
>>>> fully understand the syntax you're proposing. Can you provide one  
>>>> or two
>>>> examples of each type?
>>>>
>>>> Thanks!
>>>> Eric
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> -----Original Message-----
>>>>> From: psi...@li...
>>>>>
>>>>>
>>>>>           
>>>> [mailto:psidev-ms-dev-
>>>>
>>>>
>>>>         
>>>>> bo...@li...] On Behalf Of Matthew Chambers
>>>>> Sent: Tuesday, July 22, 2008 3:15 PM
>>>>> To: Mass spectrometry standard development
>>>>> Subject: [Psidev-ms-dev] Nailing down NativeID
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I think it's overdue that we get this part of mzML formally  
>>>>> specified
>>>>>
>>>>>
>>>>>           
>>>> -
>>>>
>>>>
>>>>         
>>>>> at least for the vendors and generic formats. I am proposing a  
>>>>> draft
>>>>>
>>>>>
>>>>>           
>>>> of
>>>>
>>>>
>>>>         
>>>>> nativeID formats, the place to put the formats in the specification
>>>>> documents, and to have mzML instance documents explicitly reference
>>>>>
>>>>>
>>>>>           
>>>> the
>>>>
>>>>
>>>>         
>>>>> format they are using. This explicit reference should be required  
>>>>> for
>>>>> semantic validation, but I'd also recommend that mzML readers that
>>>>>
>>>>>
>>>>>           
>>>> don't
>>>>
>>>>
>>>>         
>>>>> find or ignore the nativeID format term specified simply treat the
>>>>> nativeID as a free string (rendering it pretty useless, but at  
>>>>> least
>>>>> there would be a defined way to handle it). The terms would be  
>>>>> placed
>>>>>
>>>>>
>>>>>           
>>>> in
>>>>
>>>>
>>>>         
>>>>> the fileContent element to define the format for all nativeIDs in  
>>>>> the
>>>>> file.
>>>>>
>>>>> I propose that the nativeID formats become CV terms, and that the  
>>>>> term
>>>>> definitions define the formats unambiguously in a machine- 
>>>>> readable way
>>>>> that a semantic validator can use to validate the nativeIDs.  I  
>>>>> will
>>>>> list my format drafts in OBO format. Each specific native format
>>>>> definition is a comma-delimited list of key-value pairs, where  
>>>>> the key
>>>>> is the axis name (e.g. "scan number") and the value specifies the
>>>>>
>>>>>
>>>>>           
>>>> format
>>>>
>>>>
>>>>         
>>>>> of the axis in one of two ways:
>>>>> 1) a Perl-style regular expression that can provide semantic/ 
>>>>> logical
>>>>> choices for strings (e.g. "controller type" can be either "MS" or
>>>>>
>>>>>
>>>>>           
>>>> "PDA"
>>>>
>>>>
>>>>         
>>>>> or "UV" etc.)
>>>>> 2) an XSD type that can specify unrestricted strings or a numeric  
>>>>> type
>>>>> (possibly with semantic restrictions)
>>>>>
>>>>> I didn't actually need to use a regex for any of the formats below,
>>>>>
>>>>>
>>>>>           
>>>> but
>>>>
>>>>
>>>>         
>>>>> I can see their usefulness. For example, they would be needed if  
>>>>> I'm
>>>>> wrong about Xcalibur and it makes more sense for Thermo spectra  
>>>>> to use
>>>>> controller names instead of controller numbers.
>>>>>
>>>>> Obviously the syntax of the format definitions is flexible if  
>>>>> people
>>>>> have better ideas (ideally one that could combine the power of  
>>>>> regex
>>>>>
>>>>>
>>>>>           
>>>> and
>>>>
>>>>
>>>>         
>>>>> XSD; "infinite cosmic power, itty bitty living space!").
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: native spectrum identifier
>>>>> def: "References a spectrum in a native (non-mzML) spectrum source
>>>>> according to a strict format. The format is dependent on the type  
>>>>> of
>>>>>
>>>>>
>>>>>           
>>>> the
>>>>
>>>>
>>>>         
>>>>> spectra source." [PSI:MS]
>>>>> is_a: MS:1000524 ! data file content
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: native chromatogram identifier
>>>>> def: "References a chromatogram in a native (non-mzML) chromatogram
>>>>> source according to a strict format. The format is dependent on the
>>>>>
>>>>>
>>>>>           
>>>> type
>>>>
>>>>
>>>>         
>>>>> of the chromatogram source." [PSI:MS]
>>>>> is_a: MS:1000524 ! data file content
>>>>> ! note: I don't have any instances of native chromatogram  
>>>>> identifiers,
>>>>> but I can conceive of the possibilities!
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: Thermo RAW spectrum identifier
>>>>> def: "controller type=xsd:nonNegativeInteger,scan
>>>>> number=xsd:positiveInteger" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note to Jim: apparently, Xcalibur can handle multiple  
>>>>> controllers of
>>>>> the same type, so is a choice between strings still appropriate?
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: Waters RAW spectrum identifier
>>>>> def: "function number=xsd:positiveInteger,process
>>>>> number=xsd:nonNegativeInteger,scan number=xsd:positiveInteger"
>>>>>
>>>>>
>>>>>           
>>>> [PSI:MS]
>>>>
>>>>
>>>>         
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: is process number ever non-zero?
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: WIFF spectrum identifier
>>>>> def: "sample number=xsd:nonNegativeInteger,period
>>>>> number=xsd:positiveInteger,cycle  
>>>>> number=xsd:positiveInteger,experiment
>>>>> number=xsd:positiveInteger" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: ABI Oracle database spectrum identifier
>>>>> def: "" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: need expertise here; alternatively, we could lump these
>>>>>
>>>>>
>>>>>           
>>>> spectra
>>>>
>>>>
>>>>         
>>>>> in with DTA/PKL nativeIDs (see below) when they are extracted to  
>>>>> T2Ds
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: Bruker spectrum identifier
>>>>> def: "" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: need expertise here. AFAIK, each Bruker YEP/BAF/FID  
>>>>> spectrum
>>>>>
>>>>>
>>>>>           
>>>> is
>>>>
>>>>
>>>>         
>>>>> natively a single file, so that seems to make nativeID irrelevant  
>>>>> and
>>>>> sourceFile[Ref] critical
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: Shimadzu spectrum identifier
>>>>> def: "" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: need expertise here
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: MGF spectrum identifier
>>>>> def: "index=xsd:nonNegativeInteger" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: TITLE attributes are optional, so the index into the file  
>>>>> is
>>>>>
>>>>>
>>>>>           
>>>> the
>>>>
>>>>
>>>>         
>>>>> only reliable source (TITLE can be used for the string id if  
>>>>> present)
>>>>>
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: mzData/mzXML/MS2 spectrum identifier
>>>>> def: "scan number=xsd:positiveInteger" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> [Term]
>>>>> id: MS:x
>>>>> name: PKL/DTA spectrum identifier
>>>>> def: "" [PSI:MS]
>>>>> is_a: MS:x ! native spectrum identifier
>>>>> ! note: like Bruker, a PKL or DTA could be standalone so AFAIK the
>>>>>
>>>>>
>>>>>           
>>>> only
>>>>
>>>>
>>>>         
>>>>> way to reliably reference it is via sourceFileRef
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>