Re: [Psidev-ms-dev] Indexes, binary files, multiple files and compression in, mzML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The goal of suggesting a ZIP file was to keep multiple files together -
your suggestion of using XOP would both keep 'files' together, and
allows seeking directly.  I think compression will be a non-issue with a
binary representation of the raw data, and XOP is a standard which is
likely to be supported with multiple language bindings.

Does anyone have experience using XOP?  Could we get an example file to
begin thinking about what this would look like?

Randy

-----Original Message-----
From: Christopher Mason [mailto:Mas...@ma...]=20
Sent: Thursday, June 21, 2007 3:00 PM
To: psi...@li...; Randy Julian
Subject: Re: Indexes, binary files, multiple files and compression in,
mzML=20

Hello.

From: "Randy Julian" <rkj...@in...>
 > Many of us have now had experience with very large data sets
 > represented in XML and I think it would be useful to
 > seriously consider an alternative representation to the
 > embedded base64 data vectors.

I agree that it would be really nice to addess the binary/XML issue.=20
Note that combining XML and binary is not a new problem [1].

It's really painful to parse mzXML efficiently (in other words to=20
retreieve a single scan from a ~gigabyte sized document with ~10,000=20
scans).  You can't use a standard XML parser, because most don't support

arbitrarily seeking in the file to take advantage of the index (the=20
exception being Xerces, which has its own issues), and you have to write

additional code to deal with the base-64/compressed data anyway.  Anyone

who regularly deals with Orbi/FT data knows how impossible it use to=20
just use plain XML; it's a nice idea, but it just isn't=20
feasible/efficient with today's instruments.

We spent a bunch of time trying to write a simple parser for mzXML and=20
ended by giving up and using RAMP.  This is fine, but makes it difficult

for others to use languages/tools that can't use RAMP/JRAMP.  There also

maybe licensing issues.

 > My recommendation is that we consider creating a purely binary
 > representation of the data vectors which can be referenced by a
purely
 > XML document describing the experiment.

Zip files would work but you would have to write the individual scans as

separate entries in the zip file as there's no way to seek in a file=20
inside a zipfile.  I'm not sure what the efficiency of this would be,=20
but it can't be worse than the existing gzipped base64 scans in mzXML.=20
However, you couldn't use standard XML parsing utils to read the=20
metadata without first extracting the XML file from the zip file.

Why not use an existing standard like XOP [2] or something similar which

combines, in a single file, an initial, text-only XML metadata chunk,=20
followed by a binary blob for the actual data.  The metadata contains=20
references or addresses that index into the binary chunk, relative to=20
the start of the binary chunk.  Normal XML tools can parse the XML bits=20
and ignore the binary bits.  This would remove the need for a separate=20
index because the metadata would be small enough to parse completely=20
into memory, and would contain references to the much larger binary
data.

You could easily create such a file by writing two files and then=20
combining them, or by padding.  One key point would be to use UTF-8 or=20
similar encoding with a BOM [3] so that legacy FTP clients would=20
correctly transfer them.

As another idea: you could formulate the references in such a way that=20
it was possible to support both a two file/side-by-side scheme and a=20
single file, one after the other scheme.

I'd be happy to help with this; maybe doing some proof of concept
work...

Thanks for bring this up,

-c

PS- I've done some benchmarking work to study storing compressed spectra

in sqlite databases that might be of interest to others.

[1] http://www.xml.com/pub/a/2003/02/26/binaryxml.html
[2] http://www.w3.org/TR/xop10/
[3] http://unicode.org/faq/utf_bom.html#22