From: <cod...@go...> - 2008-11-27 14:20:22
|
Comment #38 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 The extra mgf metadata seems like a good reason to go with nativeID instead of ID. In mzML, the ID is totally arbitrary, but the nativeID is not. So if you're working from a non-mzML file, it's perfectly reasonable to use nativeID but not really ID. The basic nativeID for an MGF is the 0-based index into the file. If the title attribute has been written in a way that the reader can parse back to a vendor's nativeID, that's a sensible alternative. The other attributes are pretty messy IMO because they're either not required to be unique or they may encode scans or RTs from multiple acquisitions. I suggest them as userParams. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-27 15:17:39
|
Comment #39 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Interesting points Matt, and useful to have feedback from your mzML experience. When the input to the search is an mzML file, our spectrumID attribute is the mzML spectrum 'id'. This is 'easy' and a majority of people agreed at an earlier meeting that if you want further information like retention time, you need to go back to the mzML file. When the input to the search engine is an mgf file, things are not so easy, because different people use the title, scans and rtinseconds fields in different ways. Also, as you say, there is no guarantee that any of these are unique. In a case where someone has provided say the rtinseconds, but not a title, it would be useful to report this and to make it clear which of the possible values is being reported. Using a zero based index into the MGF isn't an option for the general purpose program that takes a Mascot (.dat) results file and converts it to an analysisXML file because it doesn't have the mgf file and doesn't know what the offset is. btw, in case it's not clear, we don't currently have a nativeID attribute for the <SpectrumIdentificationResult> A common use case might be that someone has an anlysisXML document originating from an mgf search and thinks a result looks 'interesting'. They then want to go back to the original 'raw' data to look at it. Ideally, this should take as few steps as possible. The only safe spectrumID value for the Mascot converter is the Mascot query number (this is not what the examples use at the moment). So, the user needs the Mascot (.dat) results file to then find the title/scan/rtinseconds and from that can determine the scan number in the raw data. Seems like a long way round to me and requires that they also have the .dat file. We are trying to not use userParams too much in analysisXML because we are keen to make the most of the cv validation tools. So, I realise it's far from ideal, but I think what I'm proposing makes the best of a far from ideal situation. Or maybe I'm missing something? -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-27 15:26:35
|
Comment #40 on issue 42 by eisenachM: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 changed obo according to comments 32-35 (for comment 33 I removed the " if a significant and fragment includes RKNQ." etc. from the defline) (for comment 27b I added a term "number of unmacthed peaks") TODOs left: - Newt.obo - add 2nd parent for Paragon scores (Sean) - decide comments 37 / 38 -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-27 15:30:37
|
Comment #41 on issue 42 by eisenachM: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 With OBOEdit 1.101 I cannot edit the obo file (OBOEdit does not show its GUI but the process is there and locks the file). Reason are the two relationship: has_units: UO:0000XXX ! name lines added in revision 273 I had to delete them in a text editor, edit with OBOEdit, then add them again with a text editor :-( -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-27 22:09:03
|
Comment #42 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Hi David, sorry for the long reply to follow... RE: spectrumID reference I'm aware of the decision to use mzML's id as the spectrumID but I'm bringing the point back up because the issue of non-mzML inputs was not discussed at the time (AFAIK). I do not see the justification for using the id instead of the nativeID when the latter must always exist for any input format whereas the former only makes sense from an actual mzML file. RE: MGF ids Having CV terms for various format attributes is not a terrible thing, but I worry because the scope is potentially much bigger than MGF->DAT->analysisXML. All of the non-mzML input formats that could potentially be used to generate an intermediate search result format and then converted to analysisXML will more often than not have this problem. Trying to account for the various transformations of the identifiers that could happen from this translation seems like a lost cause to me. The exception would be very specific pipelines where the inputs and outputs are tightly controlled and in those cases, userParams seem more appropriate than cvParams. Even in the case of MGF->DAT->analysisXML, some of your MGF inputs may be completely lacking in title, rt, and scan attributes, because they're all optional, so without an index it's all screwed! :( Just think of the combinations: modern vendor formats: Thermo RAW, Waters RAW, WIFF, YEP, BAF, FID, MassHunter, Shimadzu open formats: mzML, mzXML, mzData, MGF, DTA, MS[12], PKL, search result formats: pepXML, SQT, OUT, SRF, DAT, X! Tandem As I understand it, your specific use case is: take existing DAT files that were searched from MGFs with (unique?) title/RT/scan attributes and convert to analysisXML in a way that a generic reader can directly go back to the MGF data. The generic version of that use case is: take existing search results in any format that were searched from any spectra format and convert to analysisXML in a way that a generic reader can directly go back to the data in the input spectra format. Supporting the specific use case and not the generic one makes me cringe a bit, which is why I chimed in on the issue. Can't users just re-search their data and output directly to analysisXML with the index attribute intact? :P -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 09:32:13
|
Comment #43 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Hi Matt, Surprised to get any reply from you at all yesterday ;) You are right, we kind of side stepped the issue of non mzML/mzData input formats at the time, so it's important to hammer it out now. And yes, you are quite right, we should try and support things as generally as possible. Incidentally, my proposal for the CV wasn't quite as narrow as you suggest: MGF->DAT->analysisXML, it could be any engine or format (or no intermediate file) in the middle. I actually had a slightly different use case in mind - it wasn't for a generic automated pipeline (which as you say is impossible to do reliably), but more for 'manual' inspection. So, if someone sees something 'interesting' they stand a chance of finding the original data manually with as few intermediate files as possible. However, you've got me thinking... suppose someone was writing a pipeline. They had MGF files consistently generated by software 'X' and they were using 3 different search engines that output analysisXML files. They would surely rather that the identifiers in the analysisXML files were of a consistent format using CV rather than differing format using user params? I guess I'm not keen on the nativeID idea for the MGF because I couldn't see how it could be implemented retrospectively without the MGF files. Requiring that people re-search their data seems a little harsh. Also, there's bound to be a time period before mzML is widely supported. For .pkl and concatenated .dta files, there is no option, so I've not proposed any CV. For single dta files, there is no need for CV because it's multiple files and we have the filename. Excuse my ignorance, but what's MS[12]? Is there a guaranteed unique ID for mzXML? David -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 13:49:05
|
Comment #44 on issue 42 by eisenachM: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 We specify the source file, from which the AnalysisXML file was created and the Spectra_Data location. For our MPC use case both are URIs to database locations which I created like this: <SourceFile id="SF1" location="proteinscape://www.medizinisches-proteom-center.de/PSServer/Project/Sample/Separation_1D_LC/Fraction_X/SpectraData/Results1"/> <SpectraData id="LCMALDI_spectra" location="proteinscape://www.medizinisches-proteom-center.de/PSServer/Project/Sample/Separation_1D_LC/Fraction_X"/> 1.) Was it intended like that? 2.) I suggest to rename <SourceFile> to <SourceData> (in obo, too). -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 14:19:08
|
Comment #45 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 OK, I can see the argument of skipping the MGF and going straight to the native spectrum or spectra in a controlled manner, but I'll make a slightly more generic proposal. For example, many input formats may provide retention time information as an alternative way of identifying a scan, so that should be a generic concept. We already have it in the PSI-MS CV of course: the "scan time" term. So...now we are back to the original discussion about including mzML attributes in analysisXML (the other MGF attributes also probably belong in the MS CV, perhaps under an "alternative identifier" category). I'm not sure if your use case came up at the time though - I seem to recall it was mainly considered as a way of forwarding commonly-used attributes and not as an alternative identifier. So I would support the alternative identifier approach for TITLE, SCANS, and RAWSCANS (I could not find any documentation on the last one?), but not the retention time(s). That should re-use the existing term, multiple times if necessary. Also consider the use case of running a search straight from a native format. In such a case, the nativeID is well defined (and can be adopted now, without using mzML as an intermediate if that is not yet desired), the spectrumID is not. I think this use case is perfectly legitimate too, we do it often with MyriMatch which can read whatever formats pwiz can (currently Thermo RAW w/ Xcalibur, Waters RAW w/ MassLynx, Bruker/Agilent YEP, Bruker BAF/FID). And when we need to go back and view a spectrum in either the raw data or in an associated mzML, the best bet is to use the nativeID because it's well defined in either case. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 14:23:10
|
Comment #46 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 MS2 is essentially concatenated DTAs with added metadata to avoid the problem you mentioned about concatenated DTAs and PKLs. :) MS1 is equivalent for MS1 data. mzXML's scan attribute (xsd:nonZeroInteger) is required to be unique but obviously it can't always be used to track down the original nativeID easily. There is a nativeScanOrigin element meant to be able to do that, but it's not used frequently. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 15:12:23
|
Comment #47 on issue 42 by eisenachM: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 MPC_example.axml is an example using a protein decoy approach. It lists ALL proteins of a ProteinDetectionAnalysis and reports the "local FDR" for each protein in the score-sorted list. [BTW: In the CV we have also "local FDR" for peptides and "pep:global FDR" and "prot:global FDR", all as result values of an analysis.] We need an INPUT parameter "prot:FDR threshold" or "pep:FDR threshold" (probably in the branch "search input details"/"quality estimation method"), if we want to report only the proteins below a specified FDR. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 15:49:29
|
Comment #48 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 OK, I assume that we can now agree on adding: mgf title mgf scans mgf rawscans (The rawscans will be in the next release of Mascot, so isn't documented yet) Yes, I totally agree that the retention time (possibly multiple times as you say) should be a generic term. For better or worse, we decided after much discussion not to share CV with mzML, so we'd need to have our own term for retention time. In fact, it looks as though we already have PI:00114 (although this can't currently be used because it's not in the mapping file). Even though we aren't sharing CV, we should at least use the same name and description as in mzML: [Term] id: MS:1000016 name: scan time def: "The time that an analyzer started a scan, relative to the start of the run." [PSI:MS] xref: value-type:xsd\:float "The allowed value-type for this CV term." is_a: MS:1000503 ! scan attribute relationship: has_units UO:0000003 ! time unit Any objections? We'd not considered the use case of of running a search straight from a native format... However, are you doing some sort of peak detection? Are you merging together any spectra - i.e. will there be multiple nativeIDs for each spectrum? Might the same nativeID be used for multiple spectra. (I notice that nativeIDs don't have to be unique in mzML.) -David -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 16:27:39
|
Comment #49 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Can you point me to the discussion about (not) sharing CV? That seems a bit crazy to me (and contrary to the PSI CV guidelines?). I'm sure there are reasons though, I just want to see them. :) All of these terms are also things that would potentially be in an mzML file created from MGF (just like the Thermo filter line may be included from Thermo files), so that's why I suggested they all go in the MS CV. MGF is after all a generic MS format, not necessarily specific to proteomics even. :) NativeIDs in mzML must be unique. You just had to bring up merged spectra didn't you? ;) It gets pretty painful and hazy when the original acquisitions and their merged forms are kept in the same file. There's 2 issues there: 1) support representing both the merged spectra and the separate acquisitions as independent spectra? or only support one or the other 2) if yes to 1, and nativeID must be unique, there are several possible solutions: a) just taking the first acquisition's nativeID won't be unique, so we extend the nativeID syntax to support either ranges (Thermo: "controller=0 scan=[2,10]") or lists of nativeIDs ("controller=0 scan=2,controller=0 scan=15,controller=0 scan=50") or perhaps a combination of both b) use a special convention for nativeIDs of merged spectra that indicates to a semantic validator that the nativeID is irrelevant and only the acquisitionList is important; e.g. nativeID="merged" (since nativeID is string and not xsd:ID, it won't be invalid syntax) Really there's no nativeID for a merged spectrum, so anything we come up with is a workaround. Finally, several vendor formats allow peak picking straight out of their API, namely Thermo, ABI, and Bruker. So for these formats MyriMatch works straight off by just asking for the centroids. For other formats, we don't have an external peak picker yet (in ProteoWizard) but we will "Real Soon Now." And no, when reading straight from the vendor file we don't merge, so nativeIDs are direct. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 17:43:24
|
Comment #50 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Um. I can't find the relevant minutes that describe why we aren't importing the MS CV... I recollect that part of the discussion was along the lines that the structure would be different between the two and this could become a logistical nightmare. Also, that the mzML CV is not yet stable and trying to get the two to work together in a timely manner wasn't considered to be feasible. I'm no expert with CV, so am probably not the best person to answer this. Guess we (both groups) may have to defend this decision. btw, there's no constraint for nativeID being unique in the mzML schema so I assumed it didn't have to be unique. I think this is starting to get a little beyond the scope of analysisXML. 'All' we want is to be able to say which spectrum a result relates to, and we can realistically only report back whatever is fed into the search engine. Is the following 'good enough' for all cases (even if we aren't 100% happy with it?): The spectrumID attribute in analysisXML instance documents must be unique. spectrumID: for mzML files must be the <spectrum id> and is enforced as unique in mzML schema for mzData files, <spectrum id> value, should be unique, but not enforced in mzData schema for mzXML scan attribute, should be unique, but not enforced in mzXML schema? Other files, any unique value, possibly generated by the search engine. Add the following optional CV terms: scan time (maxOccurs="unbounded" for merged spectra) nativeID (maxOccurs="unbounded" for merged spectra) mgf title mgf scans mgf rawscans So MyriMatch would presumably report the nativeID. Does this sound reasonable? -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 18:10:28
|
Comment #51 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 We didn't have any XSD gurus in the mzML group or they didn't chime in so the uniqueness of nativeID is not XSD-derived, it's in the specification docs and the semantic validators enforce it (actually they are much more than just unique, their format is strictly defined depending on the source file). I presume this is the related XSD uniqueness code, what does it mean in plain english? :) <xsd:element name="SpectrumIdentificationList" type="psi-pi:PSI-PI.analysis.search.SpectrumIdentificationListType" abstract="false" substitutionGroup="psi-pi:AnalysisResultList"> <xsd:unique name="PK_COMPOSITE_SpecRef"> <xsd:selector xpath="./*"/> <xsd:field xpath="@spectrumID"/> <xsd:field xpath="@SpectraData_ref"/> </xsd:unique> </xsd:element> I don't understand the hesitation to use nativeID which already has the "if mzML it means this, if mzData it means this, if mzXML it means this, if MGF it means this, etc." logic defined. That way implementers can use the same nativeID parsing code for both standards. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 18:14:28
|
Comment #52 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 I meant nativeID instead of spectrumID to facilitate analysisXML output from non-mzML input. I still agree with scan time (my preference is not duplicating terms between CVs), and mgf title/scans/rawscans. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 19:16:39
|
Comment #53 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Hi Matt, The nativeID enforcement in mzML sounds good to me. I'm very sorry, but I don't actually understand what you are proposing: a) To change the name of the attribute from spectrumID -> nativeID b) For mzML, to reference the nativeID rather than the id c) Add a nativeID attribute as well as a spectrumID d) ? Or some combination of the above! Perhaps it's just getting too late for me ;) David -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-28 19:29:41
|
Comment #54 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Both a & b to emphasize the fact that the nativeID is defined no matter what the format of the source file is. Also, just like mzML, you would define that format at the top of the file, although it doesn't appear there is an analysisXML equivalent to "fileContent/fileDescription" in mzML. The nativeID formats are defined in the mzML CV and the terms map to that top header to define the nativeID format for every spectrum in the file: see CV terms starting at MS:1000767 in http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-30 19:08:07
|
Comment #55 on issue 42 by dcreasy: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Um... in discussions with members of the mzML group it's always been agreed that this is what we will be using. Last agreed and documented to use the id at a teleconference (which Eric also attended) on 2nd October: http://psidev.info/index.php?q=node/374 As you said above: "Really there's no nativeID for a merged spectrum, so anything we come up with is a workaround." Am I missing something - are you you suggesting that search engines should can not rely on the mzML id value and store this in output files? The mzML schema documentation says, for the id: <xs:documentation>A unique identifier for this spectrum. It should be expected that external files may use this identifier together with the mzML filename or accession to reference a particular spectrum.</xs:documentation> Do you think that this is incorrect? Also, to change the term from spectrumID -> nativeID would, I think be confusing. The term makes perfect sense in the context of an mzML document, but for analysisXML it could easily imply something native to the search engine rather than one of its input files? btw, the file format of all the input files (spectra, fasta, search engine outputs) are all defined in the analysisXML documents (search on <pf:fileFormat>) so I'm not sure what you mean. (Ah... I see that a couple of the examples seem to be missing these - hopefully they will get corrected soon. Thanks for pointing this out.) David -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: <cod...@go...> - 2008-11-30 19:36:25
|
Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 Yes, I was at that meeting too. :) The one (important, IMO) use case we did not consider at that time is output of analysisXML without a corresponding mzML document. In such a case, the mzML arbitrary id does not exist, but the nativeID does. This fact convinces me that nativeID is a better reference than the arbitrary id. The change of attribute name to nativeID is not so critical, but I think the risk of confusing the spectrumID with the id attribute when it actually points to the nativeID attribute is worse than the risk of confusing the nativeID attribute with some property of the search engine. I think the documentation for the nativeID attribute can easily make it clear what it's supposed to reference, especially since it's on a spectrum-centric element; you can copy it from the mzML schema (although I think this documentation could be improved upon): <xs:documentation>The native identifier for the spectrum, used by the acquisition software.</xs:documentation> It's good to know about the header information. The nativeID (or whatever it's called in analysisXML) format term would go in the spectra input definition as a CV Param required by the mapping file. -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings |
From: Jones, A. <And...@li...> - 2008-12-01 13:49:06
|
Hi all, The issues list is getting a bit messy with essentially a mailing list discussion so I'll shift the discussion back here :-) There are two points up for discussion. 1) Use of identifiers for input spectra 2) CV terms shared between psi-ms and psi-pi > Comment 54 by matthew....@vanderbilt.edu, Nov 28 (2 days ago) > Both a & b to emphasize the fact that the nativeID is defined no matter what the >format of the source file is. Also, just like mzML, you would define that format at >the top of the file, although it doesn't appear there is an analysisXML equivalent to >"fileContent/fileDescription" in mzML. The nativeID formats are defined in the mzML >CV and the terms map to that top header to define the nativeID format for every >spectrum in the file: see CV terms starting at MS:1000767 >in >http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/controlledV ocabulary/psi-ms.obo In terms of 1) I've worked through Matt's argument and I'm in general agreement that we would like to use the same system for identifying the input spectrum - these CV terms have only been added relatively recently. I did not realise that the nativeID attribute had been specified to this level, since there is no documentation about this is in the XSD or mzML specification document. I don't think we should change the name of the attribute though, since nativeID makes sense for an element called <Spectrum> in mzML but not for an element <SpectrumIdentificationResult> in analysisXML. For referencing mzML spectra, I'm still not sure which attribute we should choose to reference since the "true" (and guaranteed unique) spectrum identifier in mzML is actually the ID attribute. I can envisage a case where instruments output mzML directly and the nativeID is not implemented sensibly. The xs:ID datatype on "ID" guarantees that these will always be unique whatever changes happen to documentation in the future or whatever tools are used to create the file. So I agree with Matt but I don't want to change the schema :-) I'm happy to add something to the documentation specifying how different identifiers should be implemented, following the rules in the psi-ms CV. In terms of 2), we had made a decision in the past that we would simply create terms as we need them in PSI-PI, rather than worrying if they should be common between psi-ms and psi-pi and trying to coordinate updates across groups. If a term is present in psi-ms with the exact meaning that we want (taking into account its position in the hierarchy), I think we should just use it and update the mapping file to reference it. Are there many terms from psi-ms that we want to use? I am working on the spec document today and would like to get all issues tidied up ASAP... Cheers Andy > -----Original Message----- > From: cod...@go... [mailto:cod...@go...] > Sent: 30 November 2008 19:36 > To: psi...@li... > Subject: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > > Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues with the CV > http://code.google.com/p/psi-pi/issues/detail?id=42 > > Yes, I was at that meeting too. :) The one (important, IMO) use case we did > not > consider at that time is output of analysisXML without a corresponding mzML > document. > In such a case, the mzML arbitrary id does not exist, but the nativeID > does. This > fact convinces me that nativeID is a better reference than the arbitrary id. > > The change of attribute name to nativeID is not so critical, but I think > the risk of > confusing the spectrumID with the id attribute when it actually points to > the > nativeID attribute is worse than the risk of confusing the nativeID > attribute with > some property of the search engine. I think the documentation for the > nativeID > attribute can easily make it clear what it's supposed to reference, > especially since > it's on a spectrum-centric element; you can copy it from the mzML schema > (although I > think this documentation could be improved upon): > <xs:documentation>The native identifier for the spectrum, used by the > acquisition > software.</xs:documentation> > > It's good to know about the header information. The nativeID (or whatever > it's called > in analysisXML) format term would go in the spectra input definition as a > CV Param > required by the mapping file. > > -- > You received this message because you are listed in the owner > or CC fields of this issue, or because you starred this issue. > You may adjust your issue notification preferences at: > http://code.google.com/hosting/settings > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Psidev-pi-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev |
From: Matt C. <mat...@va...> - 2008-12-01 14:22:44
|
Jones, Andy wrote: > Hi all, > > The issues list is getting a bit messy with essentially a mailing > list discussion so I'll shift the discussion back here :-) > > There are two points up for discussion. > > 1) Use of identifiers for input spectra 2) CV terms shared between > psi-ms and psi-pi > > In terms of 1) I've worked through Matt's argument and I'm in general > agreement that we would like to use the same system for identifying > the input spectrum - these CV terms have only been added relatively > recently. I did not realise that the nativeID attribute had been > specified to this level, since there is no documentation about this > is in the XSD or mzML specification document. > > I don't think we should change the name of the attribute though, > since nativeID makes sense for an element called <Spectrum> in mzML > but not for an element <SpectrumIdentificationResult> in analysisXML. > For referencing mzML spectra, I'm still not sure which attribute we > should choose to reference since the "true" (and guaranteed unique) > spectrum identifier in mzML is actually the ID attribute. I can > envisage a case where instruments output mzML directly and the > nativeID is not implemented sensibly. The xs:ID datatype on "ID" > guarantees that these will always be unique whatever changes happen > to documentation in the future or whatever tools are used to create > the file. I contest the term "guaranteed unique" since the one doing the guaranteeing is the schema and there is no guarantee that somebody runs their output through a schema validator. :) If you take the validation step to the semantic validator (which is what the standard demands), the nativeID term is also guaranteed to be unique (and must be "implemented sensibly"), and as David suggested earlier, it should be possible to add a uniqueness constraint to the nativeID attribute in the schema even though it is xsd:string (but uniqueness is not so helpful when the actual form of a Thermo RAW id must be: "controller=xsd:positiveInteger scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother me, but I don't understand your reasoning for not changing it. :) > So I agree with Matt but I don't want to change the schema :-) I'm > happy to add something to the documentation specifying how different > identifiers should be implemented, following the rules in the psi-ms > CV. If the attribute name doesn't change, only the xsd documentation needs to be updated to reflect which attribute the spectrumID points to and that it can be used even if the input spectra file is not mzML! > In terms of 2), we had made a decision in the past that we would > simply create terms as we need them in PSI-PI, rather than worrying > if they should be common between psi-ms and psi-pi and trying to > coordinate updates across groups. If a term is present in psi-ms with > the exact meaning that we want (taking into account its position in > the hierarchy), I think we should just use it and update the mapping > file to reference it. Are there many terms from psi-ms that we want > to use? It's looking like scan time (aka retention time) will be useful in analysisXML as an "alternative identifier" for the special use case of converting existing search results to analysisXML where a reliable nativeID to the original vendor format has been lost. Presumably, even in this use case a nativeID could be provided to point back to a spectrum in the search engine's immediate spectra input file (i.e. MGF). If not even that is possible, either spectrumID has to be optional or the use case is rather suspect. :) Additionally, if your "spectrumID" attribute matches the "nativeID" attribute in mzML, the mapping file must require one of the nativeID format terms in the file header: the specific place is TBD in analysisXML, in mzML it's mapped to the fileDescription element. Remember, nativeID is always available from any input spectra file, so there's no problem requiring it as long as decent references to the input spectra are maintained. The scan time as an "alternative identifier" issue makes me wonder if a "scan time native spectrum identifier" term is called for. It still wouldn't solve all of the problems with David's use case (i.e. if the MGF was missing RTINSECONDS attributes), but it seems potentially useful. -Matt > I am working on the spec document today and would like to get all > issues tidied up ASAP... Cheers Andy > > > > > > > > -----Original Message----- From: cod...@go... > > [mailto:cod...@go...] Sent: 30 November 2008 19:36 > > To: psi...@li... Subject: [Psidev-pi-dev] > > Issue 42 in psi-pi: Issues with the CV > > > > > > Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues with > > the CV http://code.google.com/p/psi-pi/issues/detail?id=42 > > > > Yes, I was at that meeting too. :) The one (important, IMO) use > > case we did not consider at that time is output of analysisXML > > without a corresponding mzML document. In such a case, the mzML > > arbitrary id does not exist, but the nativeID does. This fact > > convinces me that nativeID is a better reference than the arbitrary > > id. > > > > The change of attribute name to nativeID is not so critical, but I > > think the risk of confusing the spectrumID with the id attribute > > when it actually points to the nativeID attribute is worse than the > > risk of confusing the nativeID attribute with some property of the > > search engine. I think the documentation for the nativeID attribute > > can easily make it clear what it's supposed to reference, > > especially since it's on a spectrum-centric element; you can copy > > it from the mzML schema (although I think this documentation could > > be improved upon): <xs:documentation>The native identifier for the > > spectrum, used by the acquisition software.</xs:documentation> > > > > It's good to know about the header information. The nativeID (or > > whatever it's called in analysisXML) format term would go in the > > spectra input definition as a CV Param required by the mapping > > file. > |
From: Jones, A. <And...@li...> - 2008-12-01 14:50:54
|
Hi Matt, Consider the following pipeline mgf --> mzML --> SearchEngine --> analysisXML Having thought about this some more, I'm fairly sure that we want to reference the ID attribute rather than nativeID. The nativeID is intended to identify the source spectrum prior to conversion to mzML format i.e. it does not strictly identify the data represented in the file. The input to analysisXML is the mzML-formatted spectrum, not the source mgf file. If we reference the nativeID, this implies that the input to the SearchEngine was the mgf representation of the spectrum. It's a minor point that makes no difference for most XML implementations but the mgf formatted spectrum and the mzML formatted spectrum are different objects. If a database implements this, it will be much simpler to have a chain of inputs and outputs with distinct IDs, reflecting the processing that has happened at each stage. From a database/LIMS or file tracking point of view, this could be significant I think. > If the attribute name doesn't change, only the xsd documentation needs > to be updated to reflect which attribute the spectrumID points to and > that it can be used even if the input spectra file is not mzML! Agreed, the documentation of the attribute does need to be improved. I prefer to have attribute names that reflect their relationship to the parent element, I think spectrumID is clear in what it refers to for SpectrumIdentificationResult. > Additionally, if your "spectrumID" attribute matches the "nativeID" > attribute in mzML, the mapping file must require one of the nativeID > format terms in the file header: the specific place is TBD in > analysisXML, in mzML it's mapped to the fileDescription element. > Remember, nativeID is always available from any input spectra file, so > there's no problem requiring it as long as decent references to the > input spectra are maintained. I'll take a look at the mzML mapping file and see what we need to do. Cheers Andy > -----Original Message----- > From: Matt Chambers [mailto:mat...@va...] > Sent: 01 December 2008 14:23 > To: psi...@li... > Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > Jones, Andy wrote: > > Hi all, > > > > The issues list is getting a bit messy with essentially a mailing > > list discussion so I'll shift the discussion back here :-) > > > > There are two points up for discussion. > > > > 1) Use of identifiers for input spectra 2) CV terms shared between > > psi-ms and psi-pi > > > > In terms of 1) I've worked through Matt's argument and I'm in general > > agreement that we would like to use the same system for identifying > > the input spectrum - these CV terms have only been added relatively > > recently. I did not realise that the nativeID attribute had been > > specified to this level, since there is no documentation about this > > is in the XSD or mzML specification document. > > > > I don't think we should change the name of the attribute though, > > since nativeID makes sense for an element called <Spectrum> in mzML > > but not for an element <SpectrumIdentificationResult> in analysisXML. > > For referencing mzML spectra, I'm still not sure which attribute we > > should choose to reference since the "true" (and guaranteed unique) > > spectrum identifier in mzML is actually the ID attribute. I can > > envisage a case where instruments output mzML directly and the > > nativeID is not implemented sensibly. The xs:ID datatype on "ID" > > guarantees that these will always be unique whatever changes happen > > to documentation in the future or whatever tools are used to create > > the file. > I contest the term "guaranteed unique" since the one doing the > guaranteeing is the schema and there is no guarantee that somebody runs > their output through a schema validator. :) If you take the validation > step to the semantic validator (which is what the standard demands), the > nativeID term is also guaranteed to be unique (and must be "implemented > sensibly"), and as David suggested earlier, it should be possible to add > a uniqueness constraint to the nativeID attribute in the schema even > though it is xsd:string (but uniqueness is not so helpful when the > actual form of a Thermo RAW id must be: "controller=xsd:positiveInteger > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother me, > but I don't understand your reasoning for not changing it. :) > > > > So I agree with Matt but I don't want to change the schema :-) I'm > > happy to add something to the documentation specifying how different > > identifiers should be implemented, following the rules in the psi-ms > > CV. > If the attribute name doesn't change, only the xsd documentation needs > to be updated to reflect which attribute the spectrumID points to and > that it can be used even if the input spectra file is not mzML! > > > > In terms of 2), we had made a decision in the past that we would > > simply create terms as we need them in PSI-PI, rather than worrying > > if they should be common between psi-ms and psi-pi and trying to > > coordinate updates across groups. If a term is present in psi-ms with > > the exact meaning that we want (taking into account its position in > > the hierarchy), I think we should just use it and update the mapping > > file to reference it. Are there many terms from psi-ms that we want > > to use? > It's looking like scan time (aka retention time) will be useful in > analysisXML as an "alternative identifier" for the special use case of > converting existing search results to analysisXML where a reliable > nativeID to the original vendor format has been lost. Presumably, even > in this use case a nativeID could be provided to point back to a > spectrum in the search engine's immediate spectra input file (i.e. > MGF). If not even that is possible, either spectrumID has to be > optional or the use case is rather suspect. :) > > Additionally, if your "spectrumID" attribute matches the "nativeID" > attribute in mzML, the mapping file must require one of the nativeID > format terms in the file header: the specific place is TBD in > analysisXML, in mzML it's mapped to the fileDescription element. > Remember, nativeID is always available from any input spectra file, so > there's no problem requiring it as long as decent references to the > input spectra are maintained. > > The scan time as an "alternative identifier" issue makes me wonder if a > "scan time native spectrum identifier" term is called for. It still > wouldn't solve all of the problems with David's use case (i.e. if the > MGF was missing RTINSECONDS attributes), but it seems potentially useful. > > -Matt > > > > I am working on the spec document today and would like to get all > > issues tidied up ASAP... Cheers Andy > > > > > > > > > > > > > > > -----Original Message----- From: cod...@go... > > > [mailto:cod...@go...] Sent: 30 November 2008 19:36 > > > To: psi...@li... Subject: [Psidev-pi-dev] > > > Issue 42 in psi-pi: Issues with the CV > > > > > > > > > Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues with > > > the CV http://code.google.com/p/psi-pi/issues/detail?id=42 > > > > > > Yes, I was at that meeting too. :) The one (important, IMO) use > > > case we did not consider at that time is output of analysisXML > > > without a corresponding mzML document. In such a case, the mzML > > > arbitrary id does not exist, but the nativeID does. This fact > > > convinces me that nativeID is a better reference than the arbitrary > > > id. > > > > > > The change of attribute name to nativeID is not so critical, but I > > > think the risk of confusing the spectrumID with the id attribute > > > when it actually points to the nativeID attribute is worse than the > > > risk of confusing the nativeID attribute with some property of the > > > search engine. I think the documentation for the nativeID attribute > > > can easily make it clear what it's supposed to reference, > > > especially since it's on a spectrum-centric element; you can copy > > > it from the mzML schema (although I think this documentation could > > > be improved upon): <xs:documentation>The native identifier for the > > > spectrum, used by the acquisition software.</xs:documentation> > > > > > > It's good to know about the header information. The nativeID (or > > > whatever it's called in analysisXML) format term would go in the > > > spectra input definition as a CV Param required by the mapping > > > file. > > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Psidev-pi-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev |
From: Matthew C. <mat...@va...> - 2008-12-01 16:04:02
|
The nativeID is intended to refer to the closest-to-native format that can be interpreted by the machine. In your pipeline, the mgf is the closest-to-native format, so yes that nativeID would and should be preserved throughout the pipeline. Your database use case cannot use mzML ids because xsd:IDs are unique within a file, not across files. You do not have any kind of guarantee that your ids will be distinct between two mzML files, not to mention the fact that non-mzML files don't even HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML What do you use for spectrumID? :) -Matt Jones, Andy wrote: > Hi Matt, > > Consider the following pipeline mgf --> mzML --> SearchEngine --> > analysisXML > > Having thought about this some more, I'm fairly sure that we want to > reference the ID attribute rather than nativeID. The nativeID is > intended to identify the source spectrum prior to conversion to mzML > format i.e. it does not strictly identify the data represented in the > file. The input to analysisXML is the mzML-formatted spectrum, not > the source mgf file. If we reference the nativeID, this implies that > the input to the SearchEngine was the mgf representation of the > spectrum. It's a minor point that makes no difference for most XML > implementations but the mgf formatted spectrum and the mzML formatted > spectrum are different objects. If a database implements this, it > will be much simpler to have a chain of inputs and outputs with > distinct IDs, reflecting the processing that has happened at each > stage. From a database/LIMS or file tracking point of view, this > could be significant I think. > > Cheers Andy > > -----Original Message----- From: Matt Chambers > > [mailto:mat...@va...] Sent: 01 December 2008 > > 14:23 To: psi...@li... Subject: Re: > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > > > Jones, Andy wrote: > >> Hi all, > >> > >> The issues list is getting a bit messy with essentially a mailing > >> list discussion so I'll shift the discussion back here :-) > >> > >> There are two points up for discussion. > >> > >> 1) Use of identifiers for input spectra 2) CV terms shared > >> between psi-ms and psi-pi > >> > >> In terms of 1) I've worked through Matt's argument and I'm in > >> general agreement that we would like to use the same system for > >> identifying the input spectrum - these CV terms have only been > >> added relatively recently. I did not realise that the nativeID > >> attribute had been specified to this level, since there is no > >> documentation about this is in the XSD or mzML specification > >> document. > >> > >> I don't think we should change the name of the attribute though, > >> since nativeID makes sense for an element called <Spectrum> in > >> mzML but not for an element <SpectrumIdentificationResult> in > >> analysisXML. For referencing mzML spectra, I'm still not sure > >> which attribute we should choose to reference since the "true" > >> (and guaranteed unique) spectrum identifier in mzML is actually > >> the ID attribute. I can envisage a case where instruments output > >> mzML directly and the nativeID is not implemented sensibly. The > >> xs:ID datatype on "ID" guarantees that these will always be > >> unique whatever changes happen to documentation in the future or > >> whatever tools are used to create the file. > > I contest the term "guaranteed unique" since the one doing the > > guaranteeing is the schema and there is no guarantee that somebody > > runs their output through a schema validator. :) If you take the > > validation step to the semantic validator (which is what the > > standard demands), the nativeID term is also guaranteed to be > > unique (and must be "implemented sensibly"), and as David suggested > > earlier, it should be possible to add a uniqueness constraint to > > the nativeID attribute in the schema even though it is xsd:string > > (but uniqueness is not so helpful when the actual form of a Thermo > > RAW id must be: "controller=xsd:positiveInteger > > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother > > me, but I don't understand your reasoning for not changing it. :) > > > > > >> So I agree with Matt but I don't want to change the schema :-) > >> I'm happy to add something to the documentation specifying how > >> different identifiers should be implemented, following the rules > >> in the psi-ms CV. > > If the attribute name doesn't change, only the xsd documentation > > needs to be updated to reflect which attribute the spectrumID > > points to and that it can be used even if the input spectra file is > > not mzML! > > > > > >> In terms of 2), we had made a decision in the past that we would > >> simply create terms as we need them in PSI-PI, rather than > >> worrying if they should be common between psi-ms and psi-pi and > >> trying to coordinate updates across groups. If a term is present > >> in psi-ms with the exact meaning that we want (taking into > >> account its position in the hierarchy), I think we should just > >> use it and update the mapping file to reference it. Are there > >> many terms from psi-ms that we want to use? > > It's looking like scan time (aka retention time) will be useful in > > analysisXML as an "alternative identifier" for the special use case > > of converting existing search results to analysisXML where a > > reliable nativeID to the original vendor format has been lost. > > Presumably, even in this use case a nativeID could be provided to > > point back to a spectrum in the search engine's immediate spectra > > input file (i.e. MGF). If not even that is possible, either > > spectrumID has to be optional or the use case is rather suspect. :) > > > > > > Additionally, if your "spectrumID" attribute matches the "nativeID" > > attribute in mzML, the mapping file must require one of the > > nativeID format terms in the file header: the specific place is TBD > > in analysisXML, in mzML it's mapped to the fileDescription element. > > Remember, nativeID is always available from any input spectra > > file, so there's no problem requiring it as long as decent > > references to the input spectra are maintained. > > > > The scan time as an "alternative identifier" issue makes me wonder > > if a "scan time native spectrum identifier" term is called for. It > > still wouldn't solve all of the problems with David's use case > > (i.e. if the MGF was missing RTINSECONDS attributes), but it seems > > potentially useful. > > > > -Matt > > > > > >> I am working on the spec document today and would like to get all > >> issues tidied up ASAP... Cheers Andy > >> > >> > >> > >> > >> > >> > >>> -----Original Message----- From: cod...@go... > >>> [mailto:cod...@go...] Sent: 30 November 2008 > >>> 19:36 To: psi...@li... Subject: > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > >>> > >>> > >>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues > >>> with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 > >>> > >>> > >>> Yes, I was at that meeting too. :) The one (important, IMO) use > >>> case we did not consider at that time is output of analysisXML > >>> without a corresponding mzML document. In such a case, the > >>> mzML arbitrary id does not exist, but the nativeID does. This > >>> fact convinces me that nativeID is a better reference than the > >>> arbitrary id. > >>> > >>> The change of attribute name to nativeID is not so critical, > >>> but I think the risk of confusing the spectrumID with the id > >>> attribute when it actually points to the nativeID attribute is > >>> worse than the risk of confusing the nativeID attribute with > >>> some property of the search engine. I think the documentation > >>> for the nativeID attribute can easily make it clear what it's > >>> supposed to reference, especially since it's on a > >>> spectrum-centric element; you can copy it from the mzML schema > >>> (although I think this documentation could be improved upon): > >>> <xs:documentation>The native identifier for the spectrum, used > >>> by the acquisition software.</xs:documentation> > >>> > >>> It's good to know about the header information. The nativeID > >>> (or whatever it's called in analysisXML) format term would go > >>> in the spectra input definition as a CV Param required by the > >>> mapping file. |
From: Jones, A. <And...@li...> - 2008-12-01 17:16:14
|
Hi Matt, It's a question of identifying objects as they are traced throughout a process. In AnalysisXML we are not representing the spectrum object, we are making an explicit reference to the spectrum as it is represented in mzML. The nativeID is preserved, since it is referenced as the input to the process that converts mgf to mzML. > HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML > What do you use for spectrumID? :) In this case, we use the identifier as specified by the "native ID" system you've defined because the input to the search was the mgf spectrum object. I agree that this system looks good and we will use it for each of the vendor-specific formats - in effect I want to add one more mapping for mzML, mapping to the mzML ID ;-) Remember, we're not trying to represent the spectrum in analysisXML, all we are saying is what spectrum did the search engine take as input. I view the conversion of an mgf spectrum to an mzML spectrum as a process that has changed the spectrum object. As such, the nativeID in mzML references the input to the conversion process and the mzML ID attribute references the (output) spectrum as it is in the file. Correct use of the identifiers maintains this trace. > Your database use case cannot use > mzML ids because xsd:IDs are unique within a file, not across files. This is true but is solved easily by prefixing all identifiers with a unique string (e.g. the file URL). The problem is worse for nativeID because this cannot be done - the mgf version of the spectrum and the mzML version of the spectrum are fundamentally different (possibly even have different precisions) so they need different identifiers. If we re-use the native identifier this implies the input to the search engine was the mgf file, which was not the case... Cheers Andy > -----Original Message----- > From: Matthew Chambers [mailto:mat...@va...] > Sent: 01 December 2008 16:03 > To: psi...@li... > Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > The nativeID is intended to refer to the closest-to-native format that > can be interpreted by the machine. In your pipeline, the mgf is the > closest-to-native format, so yes that nativeID would and should be > preserved throughout the pipeline. Your database use case cannot use > mzML ids because xsd:IDs are unique within a file, not across files. You > do not have any kind of guarantee that your ids will be distinct between > two mzML files, not to mention the fact that non-mzML files don't even > HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML > What do you use for spectrumID? :) > > -Matt > > > Jones, Andy wrote: > > Hi Matt, > > > > Consider the following pipeline mgf --> mzML --> SearchEngine --> > > analysisXML > > > > Having thought about this some more, I'm fairly sure that we want to > > reference the ID attribute rather than nativeID. The nativeID is > > intended to identify the source spectrum prior to conversion to mzML > > format i.e. it does not strictly identify the data represented in the > > file. The input to analysisXML is the mzML-formatted spectrum, not > > the source mgf file. If we reference the nativeID, this implies that > > the input to the SearchEngine was the mgf representation of the > > spectrum. It's a minor point that makes no difference for most XML > > implementations but the mgf formatted spectrum and the mzML formatted > > spectrum are different objects. If a database implements this, it > > will be much simpler to have a chain of inputs and outputs with > > distinct IDs, reflecting the processing that has happened at each > > stage. From a database/LIMS or file tracking point of view, this > > could be significant I think. > > > > Cheers Andy > > > -----Original Message----- From: Matt Chambers > > > [mailto:mat...@va...] Sent: 01 December 2008 > > > 14:23 To: psi...@li... Subject: Re: > > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > > > > > Jones, Andy wrote: > > >> Hi all, > > >> > > >> The issues list is getting a bit messy with essentially a mailing > > >> list discussion so I'll shift the discussion back here :-) > > >> > > >> There are two points up for discussion. > > >> > > >> 1) Use of identifiers for input spectra 2) CV terms shared > > >> between psi-ms and psi-pi > > >> > > >> In terms of 1) I've worked through Matt's argument and I'm in > > >> general agreement that we would like to use the same system for > > >> identifying the input spectrum - these CV terms have only been > > >> added relatively recently. I did not realise that the nativeID > > >> attribute had been specified to this level, since there is no > > >> documentation about this is in the XSD or mzML specification > > >> document. > > >> > > >> I don't think we should change the name of the attribute though, > > >> since nativeID makes sense for an element called <Spectrum> in > > >> mzML but not for an element <SpectrumIdentificationResult> in > > >> analysisXML. For referencing mzML spectra, I'm still not sure > > >> which attribute we should choose to reference since the "true" > > >> (and guaranteed unique) spectrum identifier in mzML is actually > > >> the ID attribute. I can envisage a case where instruments output > > >> mzML directly and the nativeID is not implemented sensibly. The > > >> xs:ID datatype on "ID" guarantees that these will always be > > >> unique whatever changes happen to documentation in the future or > > >> whatever tools are used to create the file. > > > I contest the term "guaranteed unique" since the one doing the > > > guaranteeing is the schema and there is no guarantee that somebody > > > runs their output through a schema validator. :) If you take the > > > validation step to the semantic validator (which is what the > > > standard demands), the nativeID term is also guaranteed to be > > > unique (and must be "implemented sensibly"), and as David suggested > > > earlier, it should be possible to add a uniqueness constraint to > > > the nativeID attribute in the schema even though it is xsd:string > > > (but uniqueness is not so helpful when the actual form of a Thermo > > > RAW id must be: "controller=xsd:positiveInteger > > > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother > > > me, but I don't understand your reasoning for not changing it. :) > > > > > > > > >> So I agree with Matt but I don't want to change the schema :-) > > >> I'm happy to add something to the documentation specifying how > > >> different identifiers should be implemented, following the rules > > >> in the psi-ms CV. > > > If the attribute name doesn't change, only the xsd documentation > > > needs to be updated to reflect which attribute the spectrumID > > > points to and that it can be used even if the input spectra file is > > > not mzML! > > > > > > > > >> In terms of 2), we had made a decision in the past that we would > > >> simply create terms as we need them in PSI-PI, rather than > > >> worrying if they should be common between psi-ms and psi-pi and > > >> trying to coordinate updates across groups. If a term is present > > >> in psi-ms with the exact meaning that we want (taking into > > >> account its position in the hierarchy), I think we should just > > >> use it and update the mapping file to reference it. Are there > > >> many terms from psi-ms that we want to use? > > > It's looking like scan time (aka retention time) will be useful in > > > analysisXML as an "alternative identifier" for the special use case > > > of converting existing search results to analysisXML where a > > > reliable nativeID to the original vendor format has been lost. > > > Presumably, even in this use case a nativeID could be provided to > > > point back to a spectrum in the search engine's immediate spectra > > > input file (i.e. MGF). If not even that is possible, either > > > spectrumID has to be optional or the use case is rather suspect. :) > > > > > > > > > Additionally, if your "spectrumID" attribute matches the "nativeID" > > > attribute in mzML, the mapping file must require one of the > > > nativeID format terms in the file header: the specific place is TBD > > > in analysisXML, in mzML it's mapped to the fileDescription element. > > > Remember, nativeID is always available from any input spectra > > > file, so there's no problem requiring it as long as decent > > > references to the input spectra are maintained. > > > > > > The scan time as an "alternative identifier" issue makes me wonder > > > if a "scan time native spectrum identifier" term is called for. It > > > still wouldn't solve all of the problems with David's use case > > > (i.e. if the MGF was missing RTINSECONDS attributes), but it seems > > > potentially useful. > > > > > > -Matt > > > > > > > > >> I am working on the spec document today and would like to get all > > >> issues tidied up ASAP... Cheers Andy > > >> > > >> > > >> > > >> > > >> > > >> > > >>> -----Original Message----- From: cod...@go... > > >>> [mailto:cod...@go...] Sent: 30 November 2008 > > >>> 19:36 To: psi...@li... Subject: > > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > >>> > > >>> > > >>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues > > >>> with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 > > >>> > > >>> > > >>> Yes, I was at that meeting too. :) The one (important, IMO) use > > >>> case we did not consider at that time is output of analysisXML > > >>> without a corresponding mzML document. In such a case, the > > >>> mzML arbitrary id does not exist, but the nativeID does. This > > >>> fact convinces me that nativeID is a better reference than the > > >>> arbitrary id. > > >>> > > >>> The change of attribute name to nativeID is not so critical, > > >>> but I think the risk of confusing the spectrumID with the id > > >>> attribute when it actually points to the nativeID attribute is > > >>> worse than the risk of confusing the nativeID attribute with > > >>> some property of the search engine. I think the documentation > > >>> for the nativeID attribute can easily make it clear what it's > > >>> supposed to reference, especially since it's on a > > >>> spectrum-centric element; you can copy it from the mzML schema > > >>> (although I think this documentation could be improved upon): > > >>> <xs:documentation>The native identifier for the spectrum, used > > >>> by the acquisition software.</xs:documentation> > > >>> > > >>> It's good to know about the header information. The nativeID > > >>> (or whatever it's called in analysisXML) format term would go > > >>> in the spectra input definition as a CV Param required by the > > >>> mapping file. > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Psidev-pi-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev |
From: Matthew C. <mat...@va...> - 2008-12-01 17:40:14
|
As you say, the file URL in the analysisXML header must be combined with the spectrum identifier to link back to the input spectrum. The nativeID works no matter what the format of the input file was. The mzML id only works for mzML input. Consider the following chain of input file processing: file:///raw/source1.raw file:///processed/source1.mzXML file:///searched/source1.analysisXML The analysisXML header would point to file:///processed/source1.mzXML as the input file and would use mzXML's nativeID, which is: "scan=xsd:nonNegativeInteger" Now substitute mzML instead of mzXML. The nativeID can now use the true native format for thermo RAW: "controller=xsd:nonNegativeInteger scan=xsd:positiveInteger" There is no difference in file tracking, it's only the fidelity of the nativeID that has improved. The nativeID works transparently in mzML: if the analysisXML header points to an mzML and uses the nativeID, that's just as effective at finding a spectrum as the arbitrary id. It just so happens that the preserved nativeID format allows machines to easily look up the spectrum in the raw data as well as the processed data despite the fact that - as you say - they're different "objects." As long as your database maintains both a processing prefix/URL as well as the nativeID, it'll be golden. > If we re-use the native identifier this implies the input to the > search engine was the mgf file, which was not the case... As above, you must maintain the an identifier to the file that the spectrum identifier pertains to - which explicitly says what the input to the search was. What implication is possible given that information? -Matt Jones, Andy wrote: > Hi Matt, > > It's a question of identifying objects as they are traced throughout > a process. In AnalysisXML we are not representing the spectrum > object, we are making an explicit reference to the spectrum as it is > represented in mzML. The nativeID is preserved, since it is > referenced as the input to the process that converts mgf to mzML. > > > HAVE an id. Consider the pipeline: mgf --> SearchEngine --> > > analysisXML What do you use for spectrumID? :) > > In this case, we use the identifier as specified by the "native ID" > system you've defined because the input to the search was the mgf > spectrum object. I agree that this system looks good and we will use > it for each of the vendor-specific formats - in effect I want to add > one more mapping for mzML, mapping to the mzML ID ;-) Remember, we're > not trying to represent the spectrum in analysisXML, all we are > saying is what spectrum did the search engine take as input. > > I view the conversion of an mgf spectrum to an mzML spectrum as a > process that has changed the spectrum object. As such, the nativeID > in mzML references the input to the conversion process and the mzML > ID attribute references the (output) spectrum as it is in the file. > Correct use of the identifiers maintains this trace. > > > Your database use case cannot use mzML ids because xsd:IDs are > > unique within a file, not across files. > > This is true but is solved easily by prefixing all identifiers with a > unique string (e.g. the file URL). The problem is worse for nativeID > because this cannot be done - the mgf version of the spectrum and the > mzML version of the spectrum are fundamentally different (possibly > even have different precisions) so they need different identifiers. > If we re-use the native identifier this implies the input to the > search engine was the mgf file, which was not the case... > > Cheers Andy > > > > > -----Original Message----- From: Matthew Chambers > > [mailto:mat...@va...] Sent: 01 December 2008 > > 16:03 To: psi...@li... Subject: Re: > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > > > The nativeID is intended to refer to the closest-to-native format > > that can be interpreted by the machine. In your pipeline, the mgf > > is the closest-to-native format, so yes that nativeID would and > > should be preserved throughout the pipeline. Your database use > > case cannot use mzML ids because xsd:IDs are unique within a file, > > not across files. You do not have any kind of guarantee that your > > ids will be distinct between two mzML files, not to mention the > > fact that non-mzML files don't even HAVE an id. Consider the > > pipeline: mgf --> SearchEngine --> analysisXML What do you use for > > spectrumID? :) > > > > -Matt > > > > > > Jones, Andy wrote: > >> Hi Matt, > >> > >> Consider the following pipeline mgf --> mzML --> SearchEngine --> > >> analysisXML > >> > >> Having thought about this some more, I'm fairly sure that we want > >> to reference the ID attribute rather than nativeID. The nativeID > >> is intended to identify the source spectrum prior to conversion > >> to mzML format i.e. it does not strictly identify the data > >> represented in the file. The input to analysisXML is the > >> mzML-formatted spectrum, not the source mgf file. If we reference > >> the nativeID, this implies that the input to the SearchEngine was > >> the mgf representation of the spectrum. It's a minor point that > >> makes no difference for most XML implementations but the mgf > >> formatted spectrum and the mzML formatted spectrum are different > >> objects. If a database implements this, it will be much simpler > >> to have a chain of inputs and outputs with distinct IDs, > >> reflecting the processing that has happened at each stage. From a > >> database/LIMS or file tracking point of view, this could be > >> significant I think. > >> > >> Cheers Andy > >>> -----Original Message----- From: Matt Chambers > >>> [mailto:mat...@va...] Sent: 01 December 2008 > >>> 14:23 To: psi...@li... Subject: Re: > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > >>> > >>> Jones, Andy wrote: > >>>> Hi all, > >>>> > >>>> The issues list is getting a bit messy with essentially a > >>>> mailing list discussion so I'll shift the discussion back > >>>> here :-) > >>>> > >>>> There are two points up for discussion. > >>>> > >>>> 1) Use of identifiers for input spectra 2) CV terms shared > >>>> between psi-ms and psi-pi > >>>> > >>>> In terms of 1) I've worked through Matt's argument and I'm in > >>>> general agreement that we would like to use the same system > >>>> for identifying the input spectrum - these CV terms have only > >>>> been added relatively recently. I did not realise that the > >>>> nativeID attribute had been specified to this level, since > >>>> there is no documentation about this is in the XSD or mzML > >>>> specification document. > >>>> > >>>> I don't think we should change the name of the attribute > >>>> though, since nativeID makes sense for an element called > >>>> <Spectrum> in mzML but not for an element > >>>> <SpectrumIdentificationResult> in analysisXML. For > >>>> referencing mzML spectra, I'm still not sure which attribute > >>>> we should choose to reference since the "true" (and > >>>> guaranteed unique) spectrum identifier in mzML is actually > >>>> the ID attribute. I can envisage a case where instruments > >>>> output mzML directly and the nativeID is not implemented > >>>> sensibly. The xs:ID datatype on "ID" guarantees that these > >>>> will always be unique whatever changes happen to > >>>> documentation in the future or whatever tools are used to > >>>> create the file. > >>> I contest the term "guaranteed unique" since the one doing the > >>> guaranteeing is the schema and there is no guarantee that > >>> somebody runs their output through a schema validator. :) If > >>> you take the validation step to the semantic validator (which > >>> is what the standard demands), the nativeID term is also > >>> guaranteed to be unique (and must be "implemented sensibly"), > >>> and as David suggested earlier, it should be possible to add a > >>> uniqueness constraint to the nativeID attribute in the schema > >>> even though it is xsd:string (but uniqueness is not so helpful > >>> when the actual form of a Thermo RAW id must be: > >>> "controller=xsd:positiveInteger scan=xsd:nonZeroInteger"). The > >>> name of the attribute doesn't bother me, but I don't understand > >>> your reasoning for not changing it. :) > >>> > >>> > >>>> So I agree with Matt but I don't want to change the schema > >>>> :-) I'm happy to add something to the documentation > >>>> specifying how different identifiers should be implemented, > >>>> following the rules in the psi-ms CV. > >>> If the attribute name doesn't change, only the xsd > >>> documentation needs to be updated to reflect which attribute > >>> the spectrumID points to and that it can be used even if the > >>> input spectra file is not mzML! > >>> > >>> > >>>> In terms of 2), we had made a decision in the past that we > >>>> would simply create terms as we need them in PSI-PI, rather > >>>> than worrying if they should be common between psi-ms and > >>>> psi-pi and trying to coordinate updates across groups. If a > >>>> term is present in psi-ms with the exact meaning that we want > >>>> (taking into account its position in the hierarchy), I think > >>>> we should just use it and update the mapping file to > >>>> reference it. Are there many terms from psi-ms that we want > >>>> to use? > >>> It's looking like scan time (aka retention time) will be useful > >>> in analysisXML as an "alternative identifier" for the special > >>> use case of converting existing search results to analysisXML > >>> where a reliable nativeID to the original vendor format has > >>> been lost. Presumably, even in this use case a nativeID could > >>> be provided to point back to a spectrum in the search engine's > >>> immediate spectra input file (i.e. MGF). If not even that is > >>> possible, either spectrumID has to be optional or the use case > >>> is rather suspect. :) > >>> > >>> > >>> Additionally, if your "spectrumID" attribute matches the > >>> "nativeID" attribute in mzML, the mapping file must require one > >>> of the nativeID format terms in the file header: the specific > >>> place is TBD in analysisXML, in mzML it's mapped to the > >>> fileDescription element. Remember, nativeID is always available > >>> from any input spectra file, so there's no problem requiring it > >>> as long as decent references to the input spectra are > >>> maintained. > >>> > >>> The scan time as an "alternative identifier" issue makes me > >>> wonder if a "scan time native spectrum identifier" term is > >>> called for. It still wouldn't solve all of the problems with > >>> David's use case (i.e. if the MGF was missing RTINSECONDS > >>> attributes), but it seems potentially useful. > >>> > >>> -Matt > >>> > >>> > >>>> I am working on the spec document today and would like to get > >>>> all issues tidied up ASAP... Cheers Andy > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> -----Original Message----- From: > >>>>> cod...@go... > >>>>> [mailto:cod...@go...] Sent: 30 November 2008 > >>>>> 19:36 To: psi...@li... Subject: > >>>>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > >>>>> > >>>>> > >>>>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: > >>>>> Issues with the CV > >>>>> http://code.google.com/p/psi-pi/issues/detail?id=42 > >>>>> > >>>>> > >>>>> Yes, I was at that meeting too. :) The one (important, IMO) > >>>>> use case we did not consider at that time is output of > >>>>> analysisXML without a corresponding mzML document. In such > >>>>> a case, the mzML arbitrary id does not exist, but the > >>>>> nativeID does. This fact convinces me that nativeID is a > >>>>> better reference than the arbitrary id. > >>>>> > >>>>> The change of attribute name to nativeID is not so > >>>>> critical, but I think the risk of confusing the spectrumID > >>>>> with the id attribute when it actually points to the > >>>>> nativeID attribute is worse than the risk of confusing the > >>>>> nativeID attribute with some property of the search engine. > >>>>> I think the documentation for the nativeID attribute can > >>>>> easily make it clear what it's supposed to reference, > >>>>> especially since it's on a spectrum-centric element; you > >>>>> can copy it from the mzML schema (although I think this > >>>>> documentation could be improved upon): > >>>>> <xs:documentation>The native identifier for the spectrum, > >>>>> used by the acquisition software.</xs:documentation> > >>>>> > >>>>> It's good to know about the header information. The > >>>>> nativeID (or whatever it's called in analysisXML) format > >>>>> term would go in the spectra input definition as a CV Param > >>>>> required by the mapping file. > > > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move Developer's > > challenge Build the coolest Linux based applications with Moblin > > SDK & win great prizes Grand prize is a trip for two to an Open > > Source event anywhere in the world > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > _______________________________________________ Psidev-pi-dev > > mailing list Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge Build the coolest Linux based applications with Moblin SDK > & win great prizes Grand prize is a trip for two to an Open Source > event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ Psidev-pi-dev mailing > list Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev > |