I would like to challenge the assumptions behind the need to put a great
deal of work into developing a "common" XML format - or, indeed, any
"standard" XML format at all - for the storage and interchange of mass
spec data.
An XML format is a syntax into which we embed semantics. The semantics
is obtained from an ontology. It is best to have one ontology that is
shared by all who create XML syntaxes. It does not matter whether one
or many syntaxes exist as long as each embeds semantics from a common
ontology.
The reasons why is does not matter include:
- any syntax will be unable to satisfy the data and workflow
requirements of all potential users, ergo many syntaxes will exist
anyway
- the cost of converting any XML syntax to any other XML syntax is
negligible, given embedded semantics from a common ontology
The element and attribute names used in an XML format include some
amount of semantic content. That semantic content, however, is usually
dependent upon a processing or workflow context shared by a project team
but no farther. The development of an ontology and a method to include
ontology links in the XML format provide a powerful method of extending
the semantic content across all potential consumers of the data,
independent of processing and workflow contexts.
An XML format is often developed not only to contain data but also to
make processing and workflows more efficient. This suggests that any
XML format designed to be "common" across many or all projects will most
often be suboptimal with respect to efficiency of processing and
workflow of any one project.
When data is encoded in a binary (non-ASCII/non-Unicode) format, that
format is usually tied to a particular processor, eg big-endian vs
little-endian, 32-bit vs 64-bit, etc. The cost of writing a data
converter to change data from binary format A to binary format B is
high. A programming language that provides bit-level manipulation - eg
C/C++, Java - is required, as is precise documentation of the 'from' and
'to' formats. Any conversion application will likely be fragile, unable
to handle the slightest change in the definition of the 'from' or 'to'
format. Given the high cost and high fragility, the need for strong
data format standards is high.
I would like to suggest that with today's tools for manipulating XML
documents - eg, Xalan, Saxon (both open source) - the cost of developing
XML format converters is almost negligible and is decreasing. I also
suggest that when the next generation of schema definition standards is
released, support for schema-embedded ontology links will enable fully
automatable XML format conversion - all that will be required is the
'from' schema, the 'to' schema and the common ontology.
If there is zero cost to convert your XML format to my XML format and
my XML format is optimised (and extended) for my processing and workflow
requirements, then I don't care what your XML format is except that it
includes links to a common ontology. You don't care (and don't know)
what my XML format is. What both of us depend upon is the ontology that
imbues our data with a common semantics that allows the *data* rather
than the *format* to be exchanged and shared.
Rather than embarking on a project to merge mzXML and mzData into a
single standard XML format, I suggest it is much more important and cost
effective to merge the ontologies into a single standard. I further
suggest that the development and maintenance of this standard ontology
become one of PSI's highest priorities.
Philip Doggett
(The above comments are mine alone and do not necessarily reflect the
views of Proteome Systems Limited.)
|