|
From: Matthew C. <mat...@va...> - 2007-06-19 13:40:13
|
I really like the new name for the format. I read the meeting minutes and such and it seems like you all have gotten the best of both formats (and then some) into the new format; I will be happy to see mzData and mzXML go away. But one thing I've missed a lot in mzData (even though I think it's a better format because of the flat spectra list) is an index to quickly access a given scan number. I know indexes in XML are not pretty or simple, but I really think having one is the difference between having to load the whole file into memory (or continually parse the file to find the desired scan number(s)) and merely jumping right to the correct point in the file to start parsing. For spectra files which are a hundred megabytes or more, and especially when reading them over a network drive, that's a very bad proposition. On a related note, is there any guarantee in mzML (or mzData for that matter) that the spectrum IDs or scan numbers are given in ascending order? The latter guarantee would at least make the absence of an index more tolerable when looking for some range of scan numbers. Thanks, Matt Chambers |
|
From: Mike C. <tu...@gm...> - 2007-06-19 15:30:17
|
On 6/19/07, Matthew Chambers <mat...@va...> wrote: > On a related note, is there any guarantee in mzML (or mzData > for that matter) that the spectrum IDs or scan numbers are given in > ascending order? This is a good question. I haven't read the spec closely, but if the answer isn't in there, it ought to be. Along those lines, are IDs and scan numbers even guaranteed to be unique within a file? (I hope the answer will be "yes".) > But one thing I've missed a lot in mzData (even though I think it's a > better format because of the flat spectra list) is an index to quickly > access a given scan number. I'm torn on this myself. On the one hand, adding *any* redundant information seems to go against the basic idea of just representing the experimental data. On the other hand, it *would* make some operations more convenient. Random access reads become easier, altering the file becomes harder, and something like XSLT transformations probably become impossible (I'm not an XSLT fan anyway). One point to consider: do we think that all of the various producers (and transformers) of these files will be capable of producing correct (bug-free) indices? If they're not *always* correct, or if you have to validate the file before you trust it, you're basically having to recreate the index anyway. If that's so, maybe it should just be left out of the mzML file altogether. It looks like indices are currently stored in a separate, optional file. This seems like a good compromise. It's worth noting that these arguments also apply to the other redundant information in the file (counts and checksums, for example). I wouldn't mind seeing those also moved to a separate file. If they're left in, maybe something should be said about what's supposed to happen when the redundant information is inconsistent. Mike |
|
From: Matthew C. <mat...@va...> - 2007-06-19 15:46:25
|
> On 6/19/07, Matthew Chambers <mat...@va...> wrote: > > On a related note, is there any guarantee in mzML (or mzData > > for that matter) that the spectrum IDs or scan numbers are given in > > ascending order? > > This is a good question. I haven't read the spec closely, but if the > answer isn't in there, it ought to be. Along those lines, are IDs and > scan numbers even guaranteed to be unique within a file? (I hope the > answer will be "yes".) > I think IDs are definitely unique within a file, and scan numbers will almost always be unique within a spectra source (multiple spectra sources can be in a single mzML file though). In our software, I use "SourceFileName.ScanNum.ChargeState" as a unique identifier so that spectra from multiple sources can be loaded into the same data structure. > > But one thing I've missed a lot in mzData (even though I think it's a > > better format because of the flat spectra list) is an index to quickly > > access a given scan number. > > I'm torn on this myself. On the one hand, adding *any* redundant > information seems to go against the basic idea of just representing > the experimental data. On the other hand, it *would* make some > operations more convenient. Random access reads become easier, > altering the file becomes harder, and something like XSLT > transformations probably become impossible (I'm not an XSLT fan > anyway). I'm not sure an index counts as redundant. It's more like metadata (i.e. I don't think you could call the index at the end of a textbook redundant!). We already store plenty of metadata, because otherwise we'd have real trouble reinterpreting the data's meaning. In fact, XML is by definition loaded with metadata. :) Random access reads wouldn't just become easier - ease of coding is not the issue to me. I just want random access to not be a computational nightmare due to an excess of XML parsing. > One point to consider: do we think that all of the various producers > (and transformers) of these files will be capable of producing correct > (bug-free) indices? If they're not *always* correct, or if you have > to validate the file before you trust it, you're basically having to > recreate the index anyway. If that's so, maybe it should just be left > out of the mzML file altogether. > > It looks like indices are currently stored in a separate, optional > file. This seems like a good compromise. >From what I've read of the minutes, and I may not have gotten the full picture on this discussion, the issue of indexes is set aside at the moment. I have to say that I dislike the idea of having a separate index on principle, but if it would really make consistency with the main file more feasible, then it's an acceptable compromise. For example, the optional index file could store a SHA-1 hash of the main mzML file. Software could test whether to trust the optional index by whether its stored hash matches a new hash on the main file. I don't think it's unreasonable to (re)generate the index whenever the main file is altered (and of course store the new hash). The optional index file method also means that the software doesn't have to skip to the end of the file to read the index. Unfortunately, indexing is just a necessary evil in a world of large datasets. -Matt Chambers |
|
From: Eric D. <ede...@sy...> - 2007-06-19 18:01:14
|
Hi everyone, thank you for the good discussion, here is what I take away from this discussion (colored with my understanding of the prevailing opinion on the various topics): - Separate index/metadata files will be avoided - The mzML index will be *optional* as a wrapper schema (with actual index at the end of the file) as currently in mzXML - The validator will enforce that scan numbers are in ascending order, but not necessarily without gaps - The validator will enforce that scan numbers and identifiers must be unique within a run (but there could be multiple runs in a file) - Regarding *always* correct indexes, users of mzXML have been using indexes for years with no reports of problems that I'm aware. Obviously if the file is altered in any way, the index should be regenerated. There are (for mzXML) / will be (for mzML) index checkers to make sure all is well along with reindexing functionality if the index is bad. - It should be a requirement for any reading software that uses the index (all readers are required to be tolerant of the presence of the wrapper schema index, but are not required to use it) to do some basic checking that the result is correct. E.g. if scan number 17500 is desired and the index is used to jump to that location, it is straightforward and necessary to ensure that the first tag read is indeed <spectrum scan_number=3D"17500">. If it is not, the software is free to do anything except continue as if it didn't know better (e.g., stop with error, revert to sequential read, or try to regenerate the index and retry). - While index/data mismatch is a potential source of problem, it has been our experience that problems are rare and the benefits huge. Regards, Eric > -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Matthew Chambers > Sent: Tuesday, June 19, 2007 8:46 AM > To: 'Mike Coleman' > Cc: psi...@li... > Subject: Re: [Psidev-ms-dev] Indexing in mzML >=20 > > On 6/19/07, Matthew Chambers <mat...@va...> wrote: > > > On a related note, is there any guarantee in mzML (or mzData > > > for that matter) that the spectrum IDs or scan numbers are given in > > > ascending order? > > > > This is a good question. I haven't read the spec closely, but if the > > answer isn't in there, it ought to be. Along those lines, are IDs and > > scan numbers even guaranteed to be unique within a file? (I hope the > > answer will be "yes".) > > >=20 > I think IDs are definitely unique within a file, and scan numbers will > almost always be unique within a spectra source (multiple spectra sources > can be in a single mzML file though). In our software, I use > "SourceFileName.ScanNum.ChargeState" as a unique identifier so that > spectra > from multiple sources can be loaded into the same data structure. >=20 > > > But one thing I've missed a lot in mzData (even though I think it's a > > > better format because of the flat spectra list) is an index to quickly > > > access a given scan number. > > > > I'm torn on this myself. On the one hand, adding *any* redundant > > information seems to go against the basic idea of just representing > > the experimental data. On the other hand, it *would* make some > > operations more convenient. Random access reads become easier, > > altering the file becomes harder, and something like XSLT > > transformations probably become impossible (I'm not an XSLT fan > > anyway). >=20 > I'm not sure an index counts as redundant. It's more like metadata (i.e. > I > don't think you could call the index at the end of a textbook redundant!). > We already store plenty of metadata, because otherwise we'd have real > trouble reinterpreting the data's meaning. In fact, XML is by definition > loaded with metadata. :) Random access reads wouldn't just become easier > - > ease of coding is not the issue to me. I just want random access to not > be > a computational nightmare due to an excess of XML parsing. >=20 >=20 > > One point to consider: do we think that all of the various producers > > (and transformers) of these files will be capable of producing correct > > (bug-free) indices? If they're not *always* correct, or if you have > > to validate the file before you trust it, you're basically having to > > recreate the index anyway. If that's so, maybe it should just be left > > out of the mzML file altogether. > > > > It looks like indices are currently stored in a separate, optional > > file. This seems like a good compromise. >=20 > >From what I've read of the minutes, and I may not have gotten the full > picture on this discussion, the issue of indexes is set aside at the > moment. > I have to say that I dislike the idea of having a separate index on > principle, but if it would really make consistency with the main file more > feasible, then it's an acceptable compromise. For example, the optional > index file could store a SHA-1 hash of the main mzML file. Software could > test whether to trust the optional index by whether its stored hash > matches > a new hash on the main file. I don't think it's unreasonable to > (re)generate the index whenever the main file is altered (and of course > store the new hash). >=20 > The optional index file method also means that the software doesn't have > to > skip to the end of the file to read the index. >=20 > Unfortunately, indexing is just a necessary evil in a world of large > datasets. >=20 > -Matt Chambers >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
|
From: Matthew C. <mat...@va...> - 2007-06-19 18:27:22
|
> Hi everyone, thank you for the good discussion, here is what I take away > from this discussion (colored with my understanding of the prevailing > opinion on the various topics): > > - Separate index/metadata files will be avoided > - The mzML index will be *optional* as a wrapper schema (with actual > index at the end of the file) as currently in mzXML This works either way for me. It makes it tricky to get to the index, but it's not any trickier than managing two files, especially if one is optional. However, unless I am missing something, is it not much harder to use a hash to check if the index is valid (i.e. that the main file has not been altered) if the index is included in the main file? Even if the hash is written as the last thing, that change alone would cause the next hash to be different, would it not? > - The validator will enforce that scan numbers are in ascending order, > but not necessarily without gaps > - The validator will enforce that scan numbers and identifiers must be > unique within a run (but there could be multiple runs in a file) I'm confused about the difference between identifiers and scan numbers. Since a mzML file can have more than one spectra source (e.g. multiple RAW files), scan numbers could only be unique within a run, as you say, but I would expect that the "SpectrumID" identifier, if it is different from the scan number, should be unique to the whole file. What is the reasoning behind the SpectrumID identifier being unique only to a run, or am I misunderstanding? What is the purpose of having a separate SpectrumID identifier anyway? > - Regarding *always* correct indexes, users of mzXML have been using > indexes for years with no reports of problems that I'm aware. Obviously > if the file is altered in any way, the index should be regenerated. > There are (for mzXML) / will be (for mzML) index checkers to make sure > all is well along with reindexing functionality if the index is bad. > - It should be a requirement for any reading software that uses the > index (all readers are required to be tolerant of the presence of the > wrapper schema index, but are not required to use it) to do some basic > checking that the result is correct. E.g. if scan number 17500 is > desired and the index is used to jump to that location, it is > straightforward and necessary to ensure that the first tag read is > indeed <spectrum scan_number="17500">. If it is not, the software is > free to do anything except continue as if it didn't know better (e.g., > stop with error, revert to sequential read, or try to regenerate the > index and retry). This is complicated by multiple sources being in a single mzML file. Will the index follow the same structure as the main section so that when you are looking for a scan number in some source, you first traverse into the source, and then look for the indexed spectrum to get its offset? <source name="someSourceName"> <spectrum scan="15"> ... </spectrum> </source> <index> <indexedSource name="someSourceName" offset="0"> <indexedSpectrum scan="15" offset="33"> ... </indexedSpectrum> </indexedSource> </index> > - While index/data mismatch is a potential source of problem, it has > been our experience that problems are rare and the benefits huge. Agreed. Regards, Matt Chambers |
|
From: Eric D. <ede...@sy...> - 2007-06-20 06:07:13
|
> From: psi...@li... [mailto:psidev-ms-dev- >=20 > > Hi everyone, thank you for the good discussion, here is what I take away > > from this discussion (colored with my understanding of the prevailing > > opinion on the various topics): > > > > - Separate index/metadata files will be avoided > > - The mzML index will be *optional* as a wrapper schema (with actual > > index at the end of the file) as currently in mzXML >=20 > This works either way for me. It makes it tricky to get to the index, but > it's not any trickier than managing two files, especially if one is > optional. However, unless I am missing something, is it not much harder > to > use a hash to check if the index is valid (i.e. that the main file has not > been altered) if the index is included in the main file? Even if the hash > is written as the last thing, that change alone would cause the next hash > to > be different, would it not? I believe that the file/index integrity checker for mzXML computes the checksum only until the start of the index. So the index is not part of the checksum and the checksum is part of the index, so there's no conflict. It does mean that a checksum computed for the entire file will not match the checksum of the non-index part. So this is slightly harder in that one can't use an OS binary to compute the checksum, but rather one needs a custom program that is aware of this arrangement. This already exists for mzXML and will be easy to port for mzML. > > - The validator will enforce that scan numbers are in ascending order, > > but not necessarily without gaps > > - The validator will enforce that scan numbers and identifiers must be > > unique within a run (but there could be multiple runs in a file) >=20 > I'm confused about the difference between identifiers and scan numbers. > Since a mzML file can have more than one spectra source (e.g. multiple RAW > files), scan numbers could only be unique within a run, as you say, but I > would expect that the "SpectrumID" identifier, if it is different from the > scan number, should be unique to the whole file. What is the reasoning You are correct, my error. > behind the SpectrumID identifier being unique only to a run, or am I > misunderstanding? What is the purpose of having a separate SpectrumID > identifier anyway? To allow LSIDs for individual spectra or some other non-integer IDs if desired. > > - Regarding *always* correct indexes, users of mzXML have been using > > indexes for years with no reports of problems that I'm aware. Obviously > > if the file is altered in any way, the index should be regenerated. > > There are (for mzXML) / will be (for mzML) index checkers to make sure > > all is well along with reindexing functionality if the index is bad. > > - It should be a requirement for any reading software that uses the > > index (all readers are required to be tolerant of the presence of the > > wrapper schema index, but are not required to use it) to do some basic > > checking that the result is correct. E.g. if scan number 17500 is > > desired and the index is used to jump to that location, it is > > straightforward and necessary to ensure that the first tag read is > > indeed <spectrum scan_number=3D"17500">. If it is not, the software = is > > free to do anything except continue as if it didn't know better (e.g., > > stop with error, revert to sequential read, or try to regenerate the > > index and retry). >=20 > This is complicated by multiple sources being in a single mzML file. Will > the index follow the same structure as the main section so that when you > are > looking for a scan number in some source, you first traverse into the > source, and then look for the indexed spectrum to get its offset? >=20 > <source name=3D"someSourceName"> > <spectrum scan=3D"15"> > ... > </spectrum> > </source> > <index> > <indexedSource name=3D"someSourceName" offset=3D"0"> > <indexedSpectrum scan=3D"15" offset=3D"33"> > ... > </indexedSpectrum> > </indexedSource> > </index> This is a very good point that has slipped notice, I think. Thanks for pointing it out, we should think about this more carefully. >=20 > > - While index/data mismatch is a potential source of problem, it has > > been our experience that problems are rare and the benefits huge. >=20 > Agreed. >=20 >=20 > Regards, > Matt Chambers >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
|
From: Mike C. <tu...@gm...> - 2007-06-19 19:29:43
|
On 6/19/07, Eric Deutsch <ede...@sy...> wrote: > - While index/data mismatch is a potential source of problem, it has > been our experience that problems are rare and the benefits huge. Just to be clear, I'm not arguing against indexing in general (which would be silly), but rather just questioning whether it makes sense to include indices in (or alongside) mzML files. >From a programming perspective, this seems like an implementation detail. One can imagine that many consumers of these files either have no use for an index or else are easily capable of simply generating an index of their own. Furthermore, applications will often have more information about the specific sort of index that would be best. As you note, if an index is included in the mzML file it can be checked for sanity. And, in fact, proper engineering requires this. If a program generates the index itself, it can afford to be somewhat trusting, but if the index is generated elsewhere, it really needs to be quite paranoid, which requires extra code. So the worry would be that this feature, which is intended to make life simpler, might end up actually making things more difficult for implementers (both producers and consumers) and bloat the mzML files to boot. Mike |
|
From: Joshua T. <jt...@sy...> - 2007-06-19 19:42:14
|
Hi Mike, Not to through any coals on the fire, but just contributing my experience as a software developer here in the Aebersold Lab (ISB): All of our mzXML processing software (the TPP, etc) relies on pre-computed indexes. Yes, it's a matter of trust, but we find it much more efficient to calculate this index once, at the time the mzXML file is created. As Eric has mentioned, in practice we don't find many (any?) errors with this, and the mzXML files actually store a checksum of themselves up to the index, which can be used to give some assurance that the index data is appropriate (I'm sure xml purists are groaning, but it works.) You may be coming from a much more stringent background, such as trying to provide regulatory compliance. As far as I can tell from today's discussions, the index will be optional, and your more stringent programs will be free to generate index-less files, ignore previously generated indexes, rewrite them to a new file, or recalcuate them yourself in your own programs. I still don't think text files are the best way to store large binary arrays (versus, for example, the netCDF format), but we've found xml with indexes to be a reasonable and useful compromise for keeping all the data in one human-readable file. Hope this helps, Josh Mike Coleman wrote: > On 6/19/07, Eric Deutsch <ede...@sy...> wrote: >> - While index/data mismatch is a potential source of problem, it has >> been our experience that problems are rare and the benefits huge. > > Just to be clear, I'm not arguing against indexing in general (which > would be silly), but rather just questioning whether it makes sense to > include indices in (or alongside) mzML files. > >>From a programming perspective, this seems like an implementation > detail. One can imagine that many consumers of these files either > have no use for an index or else are easily capable of simply > generating an index of their own. Furthermore, applications will > often have more information about the specific sort of index that > would be best. > > As you note, if an index is included in the mzML file it can be > checked for sanity. And, in fact, proper engineering requires this. > If a program generates the index itself, it can afford to be somewhat > trusting, but if the index is generated elsewhere, it really needs to > be quite paranoid, which requires extra code. > > So the worry would be that this feature, which is intended to make > life simpler, might end up actually making things more difficult for > implementers (both producers and consumers) and bloat the mzML files > to boot. > > Mike > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
|
From: Brian P. <bri...@in...> - 2007-06-20 15:28:04
|
Note also that the RAMP library (which will presumably be extended to deal with the new format) automagically deals with missing or broken indexes. You can use its index API and it will just generate an index if needed. Funny how this topic flares up every few months. Patrick's original move back in mzXML to make it an optional wrapper schema was a brilliant compromise - just don't use the index if it offends you. Brian Pratt Insilicos -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Tuesday, June 19, 2007 11:07 PM To: Matthew Chambers; Mike Coleman; psi...@li... Cc: Eric Deutsch Subject: Re: [Psidev-ms-dev] Indexing in mzML > From: psi...@li... [mailto:psidev-ms-dev- > > > Hi everyone, thank you for the good discussion, here is what I take away > > from this discussion (colored with my understanding of the prevailing > > opinion on the various topics): > > > > - Separate index/metadata files will be avoided > > - The mzML index will be *optional* as a wrapper schema (with actual > > index at the end of the file) as currently in mzXML > > This works either way for me. It makes it tricky to get to the index, but > it's not any trickier than managing two files, especially if one is > optional. However, unless I am missing something, is it not much harder > to > use a hash to check if the index is valid (i.e. that the main file has not > been altered) if the index is included in the main file? Even if the hash > is written as the last thing, that change alone would cause the next hash > to > be different, would it not? I believe that the file/index integrity checker for mzXML computes the checksum only until the start of the index. So the index is not part of the checksum and the checksum is part of the index, so there's no conflict. It does mean that a checksum computed for the entire file will not match the checksum of the non-index part. So this is slightly harder in that one can't use an OS binary to compute the checksum, but rather one needs a custom program that is aware of this arrangement. This already exists for mzXML and will be easy to port for mzML. > > - The validator will enforce that scan numbers are in ascending order, > > but not necessarily without gaps > > - The validator will enforce that scan numbers and identifiers must be > > unique within a run (but there could be multiple runs in a file) > > I'm confused about the difference between identifiers and scan numbers. > Since a mzML file can have more than one spectra source (e.g. multiple RAW > files), scan numbers could only be unique within a run, as you say, but I > would expect that the "SpectrumID" identifier, if it is different from the > scan number, should be unique to the whole file. What is the reasoning You are correct, my error. > behind the SpectrumID identifier being unique only to a run, or am I > misunderstanding? What is the purpose of having a separate SpectrumID > identifier anyway? To allow LSIDs for individual spectra or some other non-integer IDs if desired. > > - Regarding *always* correct indexes, users of mzXML have been using > > indexes for years with no reports of problems that I'm aware. Obviously > > if the file is altered in any way, the index should be regenerated. > > There are (for mzXML) / will be (for mzML) index checkers to make sure > > all is well along with reindexing functionality if the index is bad. > > - It should be a requirement for any reading software that uses the > > index (all readers are required to be tolerant of the presence of the > > wrapper schema index, but are not required to use it) to do some basic > > checking that the result is correct. E.g. if scan number 17500 is > > desired and the index is used to jump to that location, it is > > straightforward and necessary to ensure that the first tag read is > > indeed <spectrum scan_number="17500">. If it is not, the software is > > free to do anything except continue as if it didn't know better (e.g., > > stop with error, revert to sequential read, or try to regenerate the > > index and retry). > > This is complicated by multiple sources being in a single mzML file. Will > the index follow the same structure as the main section so that when you > are > looking for a scan number in some source, you first traverse into the > source, and then look for the indexed spectrum to get its offset? > > <source name="someSourceName"> > <spectrum scan="15"> > ... > </spectrum> > </source> > <index> > <indexedSource name="someSourceName" offset="0"> > <indexedSpectrum scan="15" offset="33"> > ... > </indexedSpectrum> > </indexedSource> > </index> This is a very good point that has slipped notice, I think. Thanks for pointing it out, we should think about this more carefully. > > > - While index/data mismatch is a potential source of problem, it has > > been our experience that problems are rare and the benefits huge. > > Agreed. > > > Regards, > Matt Chambers > > > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express Download DB2 Express C - > the FREE version of DB2 express and take control of your XML. No > limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
|
From: Mike C. <tu...@gm...> - 2007-06-20 19:40:21
|
On 6/20/07, Brian Pratt <bri...@in...> wrote: > Funny how this topic flares up every few months. "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'." - Isaac Asimov :-) |
|
From: Randy J. <rkj...@in...> - 2007-06-21 15:36:49
Attachments:
MRM_proposal.mzML0.91.xml
mzML0.91a.xsd
|
At ASMS several people talked to me about MRM representation in mzML. Looking at the schema, it appears that there is a way to do this using the current elements - but maybe not the way we originally thought of using them. For some time now I have been encoding MRM experiments in mzData 1.05 by using the supplemental data vector combined with the intensity element to store transitions as chromatograms. This is not documented in the specification, but if you leave the MZ vector empty, fill the intensity array with the Y-axis and then put the time axis in a supplemental data vector, it is pretty easy to parse. In the proposed mzData 1.1, I replaced this hack with an actual chromatogram element, but this too is problematic and did not make the cut for mzML. The problem stems from the fact that each MS/MS transition in an MRM-type measurement is a unique experiment and needs to be described as fully as possible. Even though we usually view (and even store in the native file) a set of transitions as a 'spectrum' they are really histograms with complex annotation on the 'x-axis'. I would like to suggest that we use the parameterGroup to store the details of each transition and then reference these within the binary vector as allowed by the 0.91 schema. It means that there is no x-axis for the 'spectrum', so we will probably want to define a way of recognizing this representation. For a quantitation experiment of 5 analytes the paramGroupList might look like this: <paramGroupList> <paramGroup id=3D"MRMSettings"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000036" name=3D"ScanMode" value=3D"Selected Reaction Monitoring"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000044" name=3D"Activation Method" value=3D"CID"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Collision Energy" value=3D"25"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Collision Engery Units" value=3D"eV"/> </paramGroup> <paramGroup id=3D"Transition_1"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"289.5"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"97.2"/> </paramGroup> <paramGroup id=3D"Transition_2"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"287.2"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"97.0"/> </paramGroup> <paramGroup id=3D"Transition_3"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"195.5"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"138.1"/> </paramGroup> <paramGroup id=3D"Transition_4"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"negative"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"205.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"161.0"/> </paramGroup> <paramGroup id=3D"Transition_5"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"negative"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"269.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"145.2"/> </paramGroup> </paramGroupList> Different instrument types could store different representations of the MRM settings (including MS^n descriptions using cvParams). For high MRM count experiments (100's or thousands of analytes) you could group parameters to further reduce replication. An MRM acquisition could then look like this: <spectrum id=3D"1" scanNumber=3D"1"> <spectrumHeader> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" name=3D"Time" value=3D"0.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" name=3D"Minutes" value=3D""/> <acquisitionList spectrumType=3D"histogram" methodOfCombination=3D"none" count=3D"1"> <acquisition acqNumber=3D"1" spectrumRef=3D"" sourceFileRef=3D""/> </acquisitionList> <instrumentSettings instrumentRef=3D"TSQ Quantum Ultra"> <paramGroupRef ref=3D"MRMSettings"/> </instrumentSettings> </spectrumHeader> <binaryData precision=3D"32" compressionType=3D"none" length=3D"3" encodedLength=3D"16"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999920" name=3D"DataArrayContentType" value=3D"histogram"/> <paramGroupRef ref=3D"Transition_1"/> <paramGroupRef ref=3D"Transition_2"/> <paramGroupRef ref=3D"Transition_3"/> <binary>mMmXQybwI0QuW01E</binary> </binaryData> </spectrum> The DataArrayContentType suggested in the 'tiny' examples the group developed could be used to indicate the meaning of the paramGroupRef's and the only change to the schema would be to order the sequence in binaryData so that the cvParam could be before the paramGroupRef (or that the order is not checked - sequences are ordered...) I've attached the small change in the element ordering needed to validate the example. I've attached the full example file (edited from the examples on the PSI site and validated with the attached schema) which shows time-dependent changes in which transitions are being monitored. To obtain the MRM chromatogram on which to perform peak picking, etc., you would plot the intensity against time for the specific transition (which is what we do in the fixed MRM drug quantitation experiments). Any thoughts on the use of the paramGroupRef in this fashion, or the idea of creating a new data type which is essentially an annotated histogram? Thanks, Randy Randall K Julian, Jr. Ph.D. CEO Indigo BioSystems (317) 536-2736 x101 (317) 306-5447 mobile www.indigobio.com NOTICE: This message may contain confidential or privileged information that is for the sole use of the intended recipient. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message |