### Email Archive: psidev-ms-dev (read-only)

 Re: [Psidev-ms-dev] mzML comments From: Fredrik Levander - 2008-08-27 07:53 ```Hi Wilfred, some comments to some of your comments: Wilfred H Tang wrote: > > * When m/z vs. intensity data is written out in profile mode, it is > pretty common to see a LARGE majority of the intensities to be zero. > Given the preponderance of zero intensities, a space-efficient way to > write the data out would be to specify a point spacing in the m/z > dimension and then write out a (m/z, intensity) pair only if the > intensity is non-zero. (Call this method 1.) The alternative, less > space-efficient way would be to write out all of the (m/z, intensity) > data pairs even though most of them have zero intensities and hence > are not all that interesting. (Call this method 2.) For method 1 to > work well, there must be a way to specify a m/z point spacing. Is > there a way to do this currently? Furthermore, the program reading in > the mzML must understand that the m/z point spacing implicitly > requires reconstruction of all the zero-intensity data pairs; > otherwise, for example, a mass spectrum plot would look funny. A > further complication for method 1 is that the m/z point spacing may > not necessarily be a constant. For example, for the AB/Sciex QSTAR > instrument, the m/z spacing is proportional to the square root of m/z, > and this is a natural consequence of this being a TOF instrument. There is a method 3 which efficiently reduces space for profile spectra which contain a lot of zeros. All data points with zero intensity that are surrounded by data points of zero intensity can be left out. If you have the following arrays: int: 1 5 1 0 0 0 0 0 1 6 m/z: 1 2 3 4 5 6 7 8 9 10 These can be reduced to: int: 1 5 1 0 0 1 6 m/z: 1 2 3 4 8 9 10 This is ok to do in mzML. On the other hand, it would be very useful with a way to specify the m/z spacing, since it can be quite tricky to get this for TOF data, especially when a calibration function have been applied over the square root spaced m/z values, so that they are no longer spaced exactly proportional to the square root of m/z. Probably the initial spacing and polynomial calibration functions could be specified using CV terms, just that such terms are not in the CV (yet). Suggestions for this would be welcome. > * The validator expects elements to appear in a certain order. This is > due to the usage of xs:sequence in the XSD file. All deviations from > the specified order are marked as errors, and I don't think that this > is really the desired behavior. There's nothing intrinsic to XML that > makes restricting order desirable, and in most cases for mzML, there > is absolutely nothing to be gained by restricting order. I think the order gets quite important since parsers will not be able to load most files into memory. SAX/StAX parsing is needed due to the large size of mzML files. Things get easier when parsing the files if we now that referencableParamGroups and other referencable things are found in the beginning of the file. > > * The validator doesn't appear to recognize at all - i.e., > any time is put into the mzML, the validator gives an > error. This may possibly be related to the previous point, but I tried > putting in all possible locations, and nothing seemed to > work. The userParams are simply ignored by the semantic validator (or at least are supposed to be). On the other hand, the xsd specifies that for a given element cvParams must come before userParams. I don't think this is a problem. If we were allowed to write a mixture of cvParams and userParams in a block of data, we could not be sure which are related anyway due to the unordered nature of XML. The tiny1 example and also the peak list example files contain userParams and validate. > > * For the element, the cvParam mapping rule "MUST supply > a *child* term of MS:1000561 (data file checksum type) one or more > times" should be deleted. The checksum of the SOURCE data file seems > to be completely irrelevant. I also agreed on this previously, but was convinced after discussions that this is important for the integrity of data. The file checksum of the source file is irrelevant when looking at spectra in the file, but very important for traceability of data, and this is also a key role of mzML. But in some cases it is not workable to retrieve the checksum of a source file, if it was several steps upstream in the analysis for example, and not available to a converter. I guess just specifying 'unknown' as checksum value is OK, the requirement for the CV term just points out that one really should try to specify the checksum value if possible. > > * There is a mistake somewhere in the rules regarding the > specification of mass analyzer. There are numerous instrument types > that have multiple mass analyzers, but the validator rejects any > instrument that contains more than one mass analyzer. Currently, only > one subelement is allowed under , and the > element is only allowed to have one child mass analyzer > type CV term. You can indeed have several analyzer elements in your componentList, see: http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML at line 29-44. Regards Fredrik ```

[Psidev-ms-dev] mzML comments Wilfred H Tang <TangWH@ap...>
 [Psidev-ms-dev] mzML comments From: Wilfred H Tang - 2008-08-27 05:58 Attachments: Message as HTML ```* When m/z vs. intensity data is written out in profile mode, it is pretty common to see a LARGE majority of the intensities to be zero. Given the preponderance of zero intensities, a space-efficient way to write the data out would be to specify a point spacing in the m/z dimension and then write out a (m/z, intensity) pair only if the intensity is non-zero. (Call this method 1.) The alternative, less space-efficient way would be to write out all of the (m/z, intensity) data pairs even though most of them have zero intensities and hence are not all that interesting. (Call this method 2.) For method 1 to work well, there must be a way to specify a m/z point spacing. Is there a way to do this currently? Furthermore, the program reading in the mzML must understand that the m/z point spacing implicitly requires reconstruction of all the zero-intensity data pairs; otherwise, for example, a mass spectrum plot would look funny. A further complication for method 1 is that the m/z point spacing may not necessarily be a constant. For example, for the AB/Sciex QSTAR instrument, the m/z spacing is proportional to the square root of m/z, and this is a natural consequence of this being a TOF instrument. * The element should accept as a subelement. Banning as a subelement should not lead to also banning as a subelement. * The validator expects elements to appear in a certain order. This is due to the usage of xs:sequence in the XSD file. All deviations from the specified order are marked as errors, and I don't think that this is really the desired behavior. There's nothing intrinsic to XML that makes restricting order desirable, and in most cases for mzML, there is absolutely nothing to be gained by restricting order. * The validator doesn't appear to recognize at all - i.e., any time is put into the mzML, the validator gives an error. This may possibly be related to the previous point, but I tried putting in all possible locations, and nothing seemed to work. * For the element, the cvParam mapping rule "MUST supply a *child* term of MS:1000561 (data file checksum type) one or more times" should be deleted. The checksum of the SOURCE data file seems to be completely irrelevant. * There is a mistake somewhere in the rules regarding the specification of mass analyzer. There are numerous instrument types that have multiple mass analyzers, but the validator rejects any instrument that contains more than one mass analyzer. Currently, only one subelement is allowed under , and the element is only allowed to have one child mass analyzer type CV term. * There are serious problems with the CV terms under scan-->scanning method and spectrum-->spectrum type. There is partial duplication of analogous terms (example of analagous terms: "SIM spectrum" <===> "selected ion monitoring") between the two categories. While it's not clear that the duplication of analogous terms is desirable, as pointed out in a previous email thread, in any case, there should be either no duplication or full duplication; partial duplication is obviously flat-out wrong. Is it possible to devise a plan to resolve this? I don't think it will take too much time and effort to work things out, but the current state of these CV terms is unworkable. * Somewhat related to the previous point, I suggest that the CV terms "full scan" and "zoom scan" under scan-->scanning method be re-named. The reason for doing a zoom scan is to scan more slowly over a smaller, zoomed m/z range (without sacrificing time) in order to obtained a spectrum with improved resolution. A name more descriptive of the purpose would be improved resolution or enhanced resolution. "Full scan" does not convey all that much information, so it could actually be removed. Thanks, Wilfred ```

 Re: [Psidev-ms-dev] mzML comments From: Fredrik Levander - 2008-08-27 07:53 ```Hi Wilfred, some comments to some of your comments: Wilfred H Tang wrote: > > * When m/z vs. intensity data is written out in profile mode, it is > pretty common to see a LARGE majority of the intensities to be zero. > Given the preponderance of zero intensities, a space-efficient way to > write the data out would be to specify a point spacing in the m/z > dimension and then write out a (m/z, intensity) pair only if the > intensity is non-zero. (Call this method 1.) The alternative, less > space-efficient way would be to write out all of the (m/z, intensity) > data pairs even though most of them have zero intensities and hence > are not all that interesting. (Call this method 2.) For method 1 to > work well, there must be a way to specify a m/z point spacing. Is > there a way to do this currently? Furthermore, the program reading in > the mzML must understand that the m/z point spacing implicitly > requires reconstruction of all the zero-intensity data pairs; > otherwise, for example, a mass spectrum plot would look funny. A > further complication for method 1 is that the m/z point spacing may > not necessarily be a constant. For example, for the AB/Sciex QSTAR > instrument, the m/z spacing is proportional to the square root of m/z, > and this is a natural consequence of this being a TOF instrument. There is a method 3 which efficiently reduces space for profile spectra which contain a lot of zeros. All data points with zero intensity that are surrounded by data points of zero intensity can be left out. If you have the following arrays: int: 1 5 1 0 0 0 0 0 1 6 m/z: 1 2 3 4 5 6 7 8 9 10 These can be reduced to: int: 1 5 1 0 0 1 6 m/z: 1 2 3 4 8 9 10 This is ok to do in mzML. On the other hand, it would be very useful with a way to specify the m/z spacing, since it can be quite tricky to get this for TOF data, especially when a calibration function have been applied over the square root spaced m/z values, so that they are no longer spaced exactly proportional to the square root of m/z. Probably the initial spacing and polynomial calibration functions could be specified using CV terms, just that such terms are not in the CV (yet). Suggestions for this would be welcome. > * The validator expects elements to appear in a certain order. This is > due to the usage of xs:sequence in the XSD file. All deviations from > the specified order are marked as errors, and I don't think that this > is really the desired behavior. There's nothing intrinsic to XML that > makes restricting order desirable, and in most cases for mzML, there > is absolutely nothing to be gained by restricting order. I think the order gets quite important since parsers will not be able to load most files into memory. SAX/StAX parsing is needed due to the large size of mzML files. Things get easier when parsing the files if we now that referencableParamGroups and other referencable things are found in the beginning of the file. > > * The validator doesn't appear to recognize at all - i.e., > any time is put into the mzML, the validator gives an > error. This may possibly be related to the previous point, but I tried > putting in all possible locations, and nothing seemed to > work. The userParams are simply ignored by the semantic validator (or at least are supposed to be). On the other hand, the xsd specifies that for a given element cvParams must come before userParams. I don't think this is a problem. If we were allowed to write a mixture of cvParams and userParams in a block of data, we could not be sure which are related anyway due to the unordered nature of XML. The tiny1 example and also the peak list example files contain userParams and validate. > > * For the element, the cvParam mapping rule "MUST supply > a *child* term of MS:1000561 (data file checksum type) one or more > times" should be deleted. The checksum of the SOURCE data file seems > to be completely irrelevant. I also agreed on this previously, but was convinced after discussions that this is important for the integrity of data. The file checksum of the source file is irrelevant when looking at spectra in the file, but very important for traceability of data, and this is also a key role of mzML. But in some cases it is not workable to retrieve the checksum of a source file, if it was several steps upstream in the analysis for example, and not available to a converter. I guess just specifying 'unknown' as checksum value is OK, the requirement for the CV term just points out that one really should try to specify the checksum value if possible. > > * There is a mistake somewhere in the rules regarding the > specification of mass analyzer. There are numerous instrument types > that have multiple mass analyzers, but the validator rejects any > instrument that contains more than one mass analyzer. Currently, only > one subelement is allowed under , and the > element is only allowed to have one child mass analyzer > type CV term. You can indeed have several analyzer elements in your componentList, see: http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML at line 29-44. Regards Fredrik ```

 Re: [Psidev-ms-dev] mzML comments From: Chris Allen - 2008-08-27 10:28 ```Fredrik Levander wrote: > Wilfred H Tang wrote: >> For method 1 to >> work well, there must be a way to specify a m/z point spacing. Is >> there a way to do this currently? Furthermore, the program reading in >> the mzML must understand that the m/z point spacing implicitly >> requires reconstruction of all the zero-intensity data pairs; >> otherwise, for example, a mass spectrum plot would look funny. Not only that, with profile data points missing it makes it very difficult (if not impossible) to fit the mass scale. Then you have to look at alternatives like regridding the data. >> A further complication for method 1 is that the m/z point spacing may >> not necessarily be a constant. For example, for the AB/Sciex QSTAR >> instrument, the m/z spacing is proportional to the square root of m/z, >> and this is a natural consequence of this being a TOF instrument. > There is a method 3 which efficiently reduces space for profile spectra > which contain a lot of zeros. All data points with zero intensity that > are surrounded by data points of zero intensity can be left out. If you > have the following arrays: > int: 1 5 1 0 0 0 0 0 1 6 > m/z: 1 2 3 4 5 6 7 8 9 10 > These can be reduced to: > int: 1 5 1 0 0 1 6 > m/z: 1 2 3 4 8 9 10 > This is ok to do in mzML. If that's OK in mzML, there really should be a CV term to say "profile with points missing" otherwise you have to look through the data to try and figure out if the spectrum is complete or not. > On the other hand, it would be very useful with a way to specify the m/z > spacing, since it can be quite tricky to get this for TOF data, > especially when a calibration function have been applied over the square > root spaced m/z values, so that they are no longer spaced exactly > proportional to the square root of m/z. Probably the initial spacing and > polynomial calibration functions could be specified using CV terms, just > that such terms are not in the CV (yet). Suggestions for this would be > welcome. Agreed, but I suspect many instrument vendors will be unwilling to divulge their calibration functions/constants. Regards, Chris ```

 Re: [Psidev-ms-dev] mzML comments From: Fredrik Levander - 2008-08-27 12:58 ```> If that's OK in mzML, there really should be a CV term to say "profile > with points missing" otherwise you have to look through the data to try > and figure out if the spectrum is complete or not. > > Yes, this was actually discussed in a thread in April: http://sourceforge.net/mailarchive/message.php?msg_id=48070F90.60609%40matrixscience.com In the last post David Creasy came up with the following list: centroided profile linear quadratic higher order polynomial FT other contiguous missing zeros (thresholded) multiple calibration coefficients multiple segments (stitched psd) I don't know if this discussion continued elsewhere, but anyway it would be nice to be able to have a CV term group for m/z spacing: linear / quadratic etc, and also possibilities for annotating contiguous / missing zeros. ms-mapping rules could then be added for 'm/z spacing', etc, for profile spectra. Fredrik ```

 Re: [Psidev-ms-dev] mzML comments From: Angel Pizarro - 2008-08-27 13:20 Attachments: Message as HTML ```On Wed, Aug 27, 2008 at 1:58 AM, Wilfred H Tang < TangWH@...> wrote: > > * When m/z vs. intensity data is written out in profile mode, it is pretty > common to see a LARGE majority of the intensities to be zero. Given the > preponderance of zero intensities, a space-efficient way to write the data > out would be to specify a point spacing in the m/z dimension and then write > out a (m/z, intensity) pair only if the intensity is non-zero. (Call this > method 1.) The alternative, less space-efficient way would be to write out > all of the (m/z, intensity) data pairs even though most of them have zero > intensities and hence are not all that interesting. (Call this method 2.) > For method 1 to work well, there must be a way to specify a m/z point > spacing. Is there a way to do this currently? Furthermore, the program > reading in the mzML must understand that the m/z point spacing implicitly > requires reconstruction of all the zero-intensity data pairs; otherwise, for > example, a mass spectrum plot would look funny. A further complication for > method 1 is that the m/z point spacing may not necessarily be a constant. > For example, for the AB/Sciex QSTAR instrument, the m/z spacing is > proportional to the square root of m/z, and this is a natural consequence of > this being a TOF instrument. > Two points: 1) compression should make file size concerns (relatively) moot. 2) including the zeros makes reading in documents (and translation to/from other formats) much easier. I just don't see the cost/benefit ratio being in favor of adding this sort of complexity to the standard. > > * The validator expects elements to appear in a certain order. This is due > to the usage of xs:sequence in the XSD file. All deviations from the > specified order are marked as errors, and I don't think that this is really > the desired behavior. There's nothing intrinsic to XML that makes > restricting order desirable, and in most cases for mzML, there is absolutely > nothing to be gained by restricting order. > Three points: 1) Order of elements is intrinsic to XML and always has been. It makes XPath travesal an implementation possible. 2) As Fredrik mentions in a later email, SAX parsing strategies are made easier due to the current ordering of elements. 3) Ordering of elements makes writing mzML documents easier. Cheers, angel ```

 Re: [Psidev-ms-dev] mzML comments From: Matthew Chambers - 2008-08-27 14:41 ```Chris Allen wrote: > Fredrik Levander wrote: > >> Wilfred H Tang wrote: >> >>> For method 1 to >>> work well, there must be a way to specify a m/z point spacing. Is >>> there a way to do this currently? Furthermore, the program reading in >>> the mzML must understand that the m/z point spacing implicitly >>> requires reconstruction of all the zero-intensity data pairs; >>> otherwise, for example, a mass spectrum plot would look funny. >>> > Not only that, with profile data points missing it makes it very > difficult (if not impossible) to fit the mass scale. Then you have to > look at alternatives like regridding the data. > It's not clear to me what you mean here. If you're referring to the scan range, that is specified in . >>> A further complication for method 1 is that the m/z point spacing may >>> not necessarily be a constant. For example, for the AB/Sciex QSTAR >>> instrument, the m/z spacing is proportional to the square root of m/z, >>> and this is a natural consequence of this being a TOF instrument. >>> >> There is a method 3 which efficiently reduces space for profile spectra >> which contain a lot of zeros. All data points with zero intensity that >> are surrounded by data points of zero intensity can be left out. If you >> have the following arrays: >> int: 1 5 1 0 0 0 0 0 1 6 >> m/z: 1 2 3 4 5 6 7 8 9 10 >> These can be reduced to: >> int: 1 5 1 0 0 1 6 >> m/z: 1 2 3 4 8 9 10 >> This is ok to do in mzML. >> > If that's OK in mzML, there really should be a CV term to say "profile > with points missing" otherwise you have to look through the data to try > and figure out if the spectrum is complete or not. > If profile data is not contiguous, it must have gone through some data processing to do the zero thresholding (typically resulting in output like method 3 which preserves the integrity of each profile). That processing would be represented in dataProcessing: [Term] id: MS:1000594 name: low intensity data point removal def: "The removal of very low intensity data points that are likely to be spurious noise rather than real signal." [PSI:MS] [Term] id: MS:1000629 name: low intensity threshold def: "Threshold below which some action is taken." [PSI:MS] If users desire the unprocessed data, that is the job of the converter, or a "data unprocessor" :) ```

 [Psidev-ms-dev] multiple mass analyzers, documentation From: Wilfred H Tang - 2008-08-27 16:58 Attachments: Message as HTML ```> > * There is a mistake somewhere in the rules regarding the > > specification of mass analyzer. There are numerous instrument types > > that have multiple mass analyzers, but the validator rejects any > > instrument that contains more than one mass analyzer. Currently, only > > one subelement is allowed under , and the > > element is only allowed to have one child mass analyzer > > type CV term. > You can indeed have several analyzer elements in your componentList, see: > http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML > at line 29-44. Yes, I see from the example how that is supposed to work. So it appears that the documentation (http://www.sbeams.org/tmp/mzML1.0.0.html#analyzer) is wrong? Element Definition: List with the different components used in the mass spectrometer. At least one source, one mass analyzer and one detector need to be specified. Attributes: Attribute Name Data Type Use Definition count xs:nonNegativeInteger required The number of components in this list. Subelements: Subelement Name minOccurs maxOccurs Definition source 1 1 A source component. analyzer 1 1 A mass analyzer (or mass filter) component. detector 1 1 A detector component. ```

 Re: [Psidev-ms-dev] multiple mass analyzers, documentation From: Wilfred H Tang - 2008-08-27 21:16 Attachments: Message as HTML ```More to the point, the XSD + ms-mapping.xml file say one thing, while the HTML documentation says another thing. Why is there a discrepancy? And is there consensus as to which is correct? Thanks, Wilfred Wilfred H Tang Sent by: psidev-ms-dev-bounces@... 08/27/2008 09:58 AM Please respond to Mass spectrometry standard development To Mass spectrometry standard development cc Subject [Psidev-ms-dev] multiple mass analyzers, documentation > > * There is a mistake somewhere in the rules regarding the > > specification of mass analyzer. There are numerous instrument types > > that have multiple mass analyzers, but the validator rejects any > > instrument that contains more than one mass analyzer. Currently, only > > one subelement is allowed under , and the > > element is only allowed to have one child mass analyzer > > type CV term. > You can indeed have several analyzer elements in your componentList, see: > http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML > at line 29-44. Yes, I see from the example how that is supposed to work. So it appears that the documentation (http://www.sbeams.org/tmp/mzML1.0.0.html#analyzer) is wrong? Element Definition: List with the different components used in the mass spectrometer. At least one source, one mass analyzer and one detector need to be specified. Attributes: Attribute Name Data Type Use Definition count xs:nonNegativeInteger required The number of components in this list. Subelements: Subelement Name minOccurs maxOccurs Definition source 1 1 A source component. analyzer 1 1 A mass analyzer (or mass filter) component. detector 1 1 A detector component. ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Psidev-ms-dev mailing list Psidev-ms-dev@... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev ```

 Re: [Psidev-ms-dev] multiple mass analyzers, documentation From: Lennart Martens - 2008-08-28 10:44 ```Hi Wilfred, The instrument configuration is a rather complex bit of XML Schema engineering. The configuration has a list of components (at least 3 components must be specified) and component is an abstract type for source, analyzer and detector. This allows the mixing of any number of components in any order, as long as there are a minimum of three (a source, an analyzer, and a detector). As you've pointed out, the documentation thus contains some errors. We'll fix these. Finally, when in doubt, always consult the XML Schema -- it is the basic definition of an mzML file. The semantic validator then simply adds an additional layer on top of the schema. Cheers, lnnrt. Wilfred H Tang wrote: > > > > * There is a mistake somewhere in the rules regarding the > > > specification of mass analyzer. There are numerous instrument types > > > that have multiple mass analyzers, but the validator rejects any > > > instrument that contains more than one mass analyzer. Currently, only > > > one subelement is allowed under , and the > > > element is only allowed to have one child mass analyzer > > > type CV term. > > > You can indeed have several analyzer elements in your componentList, see: > > > http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML > > at line 29-44. > > Yes, I see from the example how that is supposed to work. So it appears > that the documentation > (http://www.sbeams.org/tmp/mzML1.0.0.html#analyzer) is wrong? > > *Element * > *Definition:* List with the different components used in the mass > spectrometer. At least one source, one mass analyzer and one detector > need to be specified. > *Attributes:* > *Attribute Name* > > *Data Type* > > *Use* > > *Definition* > count xs:nonNegativeInteger required The number of components in this > list. > > > *Subelements:* > *Subelement Name* > > *minOccurs* > > *maxOccurs* > > *Definition* > _source_ 1 1 A > source component. > _analyzer_ 1 1 A > mass analyzer (or mass filter) component. > _detector_ 1 1 A > detector component. > > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > ```