Re: [Psidev-pi-dev] storing intermediate results in mzML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Richard,

You posted this message on the analysisXML mailing list. :) I'll cc it 
to the ms-dev list along with my response, although I expect the PI guys 
will have valuable comments as well.

Richard Scheltema wrote:
> Dear all,
>
> I'm writing a metabolomics Java library for my processing software 
> targeted towards high resolution LC/MS data (components like: peak 
> picking, noise detection, etc). The basic element for this library is a 
> mass chromatogram (x-axis time, y-axis intensity). In order to deal with 
> the size of the datasets, enable other people to easily write 
> components, etc. an intermediate file format is required. I've already 
> defined one myself to see what would be required for a really dynamic 
> pipeline and created a library on top of it. In order to hop on the 
> international standards bandwagon I would off course like to move to a 
> recognized standard like mzML. However, this _seems_ to fall short of my 
> requirements. I've been browsing through this list and the standards 
> document and tried to implement something, but I can't seem to figure 
> out how to approach this.
>
> What I would like to be able to do is store the following pieces of 
> information:
>
> - Multiple runs in a single file
>  From the specification document: "A run in mzML should correspond to a 
> single, consecutive and coherent set of scans on an instrument". This 
> means that I essentially can only store data from a single raw file? I 
> would like to do mix-models (see sets).
>   
MzML does not support multiple runs in a single file. It was a design 
decision to make the format simpler. Nothing is stopping you from 
managing a collection of mzML files as a set, though. But it's important 
to understand that mzML is intended to store the raw output of a 
vendor's instrument, as well as processing to the level of a peak list. 
The chromatogram semantics were tacked on as something of an 
afterthought: I lobbied for it because on-demand building of 
chromatograms from tens of thousands of tiny SRM spectra is ridiculously 
slow.

> - MassChromatograms (single mass)
> I would like to ubiquitously store mass chromatograms picked from the 
> raw data. This means that both mass chromatograms made from centroid as 
> well as profile data need to be stored. There is the option to store 
> chromatogram data, but this seems limited to '2D' data where mass 
> chromatogram data build with profile data needs '3D' data. In order to 
> solve this the accession="MS:1000627" name="selected ion current 
> chromatogram" needs sub-children? Or can it be set globally in the 
> header or run, but then I would like to be able to mix models (see sets).
>   
I'm not sure what this would look like. We certainly didn't have 3d 
chromatograms in mind, but perhaps they can be accommodated. Would that 
be a chromatogram with three axes (data arrays): time, m/z, and 
intensity? Is this akin to the "psuedo-2d-gel" view?

> - BackgroundIons
> The use of background ions are part of the pipeline. To store this an 
> addition to the CV needs to be made.
>  name: backgroundion chromatogram
>  is_a: chromatogram type (MS:1000626)
>  definition: chromatogram created by creating an array of a ubiquitously 
> present mass.
>   
Yes, I think a background ion chromatogram is reasonable. But there may 
be some semantics to work out: is there a distinction between 
"background ions" and "noise"? I know there are different kinds of noise 
to think about...

> - sets
> The goal of the pipeline is to combine information from lots of 
> measurements (biological, technical replicates, different machines, etc) 
> and do perform various analysis methods. This means for example that 
> background ions or mass chromatograms from various measurements need to 
> be combined into sets. Different operations and visualizations are then 
> possible on the data stored in the files. The same as the different runs 
> applies here. Another option would be to make sets of sets, which means 
> that the relation needs to be recursive. I can probably solve some of 
> the issues with id-fields, but that would make it hard to parse for 
> other people and sort of rule out the recursive relation.
>
>
> I can off course solve a lot with the use of userParam tags, but then 
> other people will have a hard time reading the data. Another thing, I 
> once heard somebody mention AnalysisML to be something along these 
> lines, however this project seems to have suffered a fatal end as I 
> cannot find anything? Am I trying to use mzML outside of its boundries? 
> If so, is a viable alternative available which I have so far been unable 
> to find?
>   
AnalysisXML is alive and well and about to submit to the PSI 
documentation process AFAIK. As I understand it, it is mostly a 
proteomics identification standard at the moment, but with a flexible 
enough framework to support other kinds of data in later releases. I 
don't recall if it supports multiple runs per file. You can check out 
examples and schema at: http://code.google.com/p/psi-pi/

-Matt