[Psidev-ms-dev] MzData comments

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Working on implementation of mzData and mzAnalysis in the Proteios database 
project (www.proteios.org) I came up with some questions on mzData which 
some of you may be interested in and have comments on. As I got some 
excellent answers from Randall Julian these are posted too, along with some 
new comments (maybe a bit messy):

At 14:04 2005-12-16, Randy Julian wrote:
>Fredrik,
>
>Thanks for your note - I have some comments mixed in below:
>
>
>Fredrik Levander wrote:
>
>>
>>We are a little bit confused on how to use mzData 1.05 when a peak list 
>>is processed multiple times with different software. We do peak 
>>extraction of MALDI spectra with one software module and then filtering 
>>and recalibration with another. The question is how to handle this with 
>>mzData. One approach would be to generate one mzData for the first step 
>>and another for the second step. This way the 
>>description.admin.sourceFile for the second mzData could be the first 
>>mzData and the 'software' element in each is the software that was used 
>>in that processing step. This should be ok. However, if we do not keep 
>>the intermediate peak lists there should be multiple 
>>description.dataProcessing elements which is not allowed in the mzData. I 
>>therefore suggest that multiple dataProcessing elements should be 
>>allowed, or is there any other solution on how to treat this?
>I will look at the idea of making the dataProcessing elements unbounded, 
>but the way we imagined doing what you are suggesting is to use the 
>supplemental data section.  Depending on what you want to be considered 
>the primary spectrum, you can store one spectrum in the mz/Inten vectors, 
>and all other spectra as supplemental data.  The description of how the 
>supplemental data were created is supposed to be handled by the 
>supplemental data description.  This part has not been tested that much 
>and the original use case was for storing time-domain or frequency-domain 
>data which were used to compute a mass spectrum.  I would like to work 
>with you on this so that if there is some problem with your use case we 
>can correct mzData for the 1.1 release.
Storing intermediate spectra as supplemental data should work fine. 
However, I would guess that most search engines and other software would 
only handle the peak lists that come in the intenArrayBinary and 
mzArrayBinary elements. If one would like to use peak lists that are in the 
supplementary data section this may be less straightforward.
Making 'dataProcessing' unbounded would still be useful for the use case 
when several software modules are used in a pipe without any need for 
storing intermediate data. It is still of interest to store all the 
parameters from the different modules, and I think that the dataProcessing 
section would be the most logical place.

>>I have had a look at the mzData 1.10 draft and I do not think I get how 
>>the chromatogram info is going to be used. As I see it the mzData could 
>>either contain raw data, in which case chromatogram data do not have to 
>>be there since it is contained in the spectra, or peak lists. If the data 
>>is peak lists, every 'spectrum' is actually a peak list which may have 
>>been generated from the summing of neighboring spectra, maybe in 
>>combination with peak extraction. In this case chromatographic 
>>information for that specific peak list, or rather for individual peaks 
>>could be intersting to keep for quantitation purposes. However, then the 
>>chromatogram info should belong to individual spectrum elements, and not 
>>to the entire mzData as is currently suggested. As it is now the 
>>chromatogramList is placed in the top of the structure, I therefore 
>>wonder what it is supposed to contain? Is it for intensity traces of 
>>individual masses over an entire LC-MS run?
>The chromatogram element was intended to answer two use cases: 1) single- 
>and multiple-reaction-monitoring experiments in which the output from the 
>instrument is more rightly considered single channel than spectral; 2) 
>convenience for those who wanted to store TIC, BP or Extracted-Ion 
>chromatograms (the case you mentioned).
>
>The TIC, BP and XIC can always be computed, so we left it out of 1.05, but 
>the MRM experiments now being developed for targeted proteomics are a 
>problem for 1.05.  In principle, you could store a single mz value and a 
>single intensity for thousands of 'spectra', but this is a waste compared 
>to storing them as a chromatogram.

>>The mzData accessionNumber: has it been dropped? I actually liked it as 
>>it was a quick way of identifying mzData files. Of course the sampleName 
>>could be used as an identifier, but it was nice to have it as a top level 
>>attribute, and accession Number sounds more like an ID than sampleName. I 
>>guess no one else wanted to keep that attribute?
>No, we should still have an accession number, but it may have been 
>accidentally dropped while editing - I will check on this.
I just didn't find it in the documentation, but I can see now that it is 
still in the schemas, sorry. Anyway, how about making this identifier 
really unique by specifying a URI-type format or similar. When people start 
to make their mzData files publicly available it would be great if the 
files could be identified unambiguously.

>>A last comment is that the models contains many nodes and one-to-one 
>>relations which could be 'flattened'. For example. The 'software' element 
>>could be put directly into the dataProcessing element or even in the 
>>description container. I think this would make processing of the files 
>>easier and the schemas could be used straight off as database models with 
>>good performance. However, I guess it is more important that the xml 
>>files are easy to read by eye, so I'm not that sure.
>There are some legacy ideas still captured in the schema which should 
>eventually be worked out, in this area the most important idea is to 
>maintain stability unless some change adds exceptional value.  There are 
>some very "relational" things in this schema which are either easy or hard 
>depending on your perspective.  The idea that we use references to allow 
>spectra to point to parent spectra and to experimental descriptions seems 
>hard to deal with in straight XML, but maps well onto relational tables 
>and models.  Originally, the idea was to design a schema which, when fed 
>into an XML binding API (like JAXB, or Castor) would produce reasonable 
>classes.  In some instances, this is true, in others, it's not true - the 
>output from the JAXB schema compiler is a bit hard to read.
>
>Right now, refactoring ideas are centering on how to merge mzData with 
>mzXML to get a single data representation specification.  The main 
>difference appears to be in the method of extension: the use of controlled 
>vocabularies compared to fixed schema elements.  The plan is to have 1.1 
>come out with a formal specification and remain stable through most of 
>2006 while the PSI team works with the ISB and mzXML developers to propose 
>a merger for public review in the summer/fall 2006 and adoption via vote 
>by the PSI, but likely to be targeted to the 2007 Spring PSI meeting.

Fredrik Levander

Dept of Protein Technology
Lund University
Sweden