From: Matthew C. <mat...@va...> - 2008-12-15 16:12:11
|
Hi Richard, You posted this message on the analysisXML mailing list. :) I'll cc it to the ms-dev list along with my response, although I expect the PI guys will have valuable comments as well. Richard Scheltema wrote: > Dear all, > > I'm writing a metabolomics Java library for my processing software > targeted towards high resolution LC/MS data (components like: peak > picking, noise detection, etc). The basic element for this library is a > mass chromatogram (x-axis time, y-axis intensity). In order to deal with > the size of the datasets, enable other people to easily write > components, etc. an intermediate file format is required. I've already > defined one myself to see what would be required for a really dynamic > pipeline and created a library on top of it. In order to hop on the > international standards bandwagon I would off course like to move to a > recognized standard like mzML. However, this _seems_ to fall short of my > requirements. I've been browsing through this list and the standards > document and tried to implement something, but I can't seem to figure > out how to approach this. > > What I would like to be able to do is store the following pieces of > information: > > - Multiple runs in a single file > From the specification document: "A run in mzML should correspond to a > single, consecutive and coherent set of scans on an instrument". This > means that I essentially can only store data from a single raw file? I > would like to do mix-models (see sets). > MzML does not support multiple runs in a single file. It was a design decision to make the format simpler. Nothing is stopping you from managing a collection of mzML files as a set, though. But it's important to understand that mzML is intended to store the raw output of a vendor's instrument, as well as processing to the level of a peak list. The chromatogram semantics were tacked on as something of an afterthought: I lobbied for it because on-demand building of chromatograms from tens of thousands of tiny SRM spectra is ridiculously slow. > - MassChromatograms (single mass) > I would like to ubiquitously store mass chromatograms picked from the > raw data. This means that both mass chromatograms made from centroid as > well as profile data need to be stored. There is the option to store > chromatogram data, but this seems limited to '2D' data where mass > chromatogram data build with profile data needs '3D' data. In order to > solve this the accession="MS:1000627" name="selected ion current > chromatogram" needs sub-children? Or can it be set globally in the > header or run, but then I would like to be able to mix models (see sets). > I'm not sure what this would look like. We certainly didn't have 3d chromatograms in mind, but perhaps they can be accommodated. Would that be a chromatogram with three axes (data arrays): time, m/z, and intensity? Is this akin to the "psuedo-2d-gel" view? > - BackgroundIons > The use of background ions are part of the pipeline. To store this an > addition to the CV needs to be made. > name: backgroundion chromatogram > is_a: chromatogram type (MS:1000626) > definition: chromatogram created by creating an array of a ubiquitously > present mass. > Yes, I think a background ion chromatogram is reasonable. But there may be some semantics to work out: is there a distinction between "background ions" and "noise"? I know there are different kinds of noise to think about... > - sets > The goal of the pipeline is to combine information from lots of > measurements (biological, technical replicates, different machines, etc) > and do perform various analysis methods. This means for example that > background ions or mass chromatograms from various measurements need to > be combined into sets. Different operations and visualizations are then > possible on the data stored in the files. The same as the different runs > applies here. Another option would be to make sets of sets, which means > that the relation needs to be recursive. I can probably solve some of > the issues with id-fields, but that would make it hard to parse for > other people and sort of rule out the recursive relation. > > > I can off course solve a lot with the use of userParam tags, but then > other people will have a hard time reading the data. Another thing, I > once heard somebody mention AnalysisML to be something along these > lines, however this project seems to have suffered a fatal end as I > cannot find anything? Am I trying to use mzML outside of its boundries? > If so, is a viable alternative available which I have so far been unable > to find? > AnalysisXML is alive and well and about to submit to the PSI documentation process AFAIK. As I understand it, it is mostly a proteomics identification standard at the moment, but with a flexible enough framework to support other kinds of data in later releases. I don't recall if it supports multiple runs per file. You can check out examples and schema at: http://code.google.com/p/psi-pi/ -Matt |