Re: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> From: Angel Pizarro

> > 1.  Loss of readability.  ...

> There actually is a space for "human readable spectra" in the=20
> mzData format,=20

I'm glad to hear that.  I looked for this, but I did not see it in the
spec here

=09
http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData

I was looking for something like a 'mzArray' and 'intenArray' tags,
which would be the textual alternatives to 'mzArrayBinary' and
'intenArrayBinary'.  Can you point me to an example?

> but really who reads individual mz and intensity values?

Well--I do.  As a programmer I don't think it's an exaggeration to say
that I'm looking at the peak lists in our ms2 files every day.  I find
being able to see at a glance that the peaks are basically sane, and
their gross attributes (precision, count, etc.) very useful.

Of course, as a programmer I can easy whip up a script to decode this
file format.  I suspect most users would be stymied, though, and I think
that that would be unfortunate.  Since these files are part of a chain
of scientific argument, I think that as much as possible they ought to
be transparent and as open as possible to verification by eyeball (mine
and those of our scientists) and alternative pieces of software.

I'm not saying that this transparency is an absolute good.  Perhaps it
is worth impairing so that we can have X, Y, and Z, which are considered
more valuable.  I'm not seeing what X, Y, and Z are, though.

> > 2.  Increased file size.  ...

> Not a fair comparison. Most of the space in an mzData file is=20
> actually taken up by the human-readable parameters and parameter=20
> values of the spectra.

Sorry, I should have been clearer.  The numbers I gave were just for the
peak lists (base64 vs text) and nothing else--no tags, no other
metadata.  The rest of the mzData fields would add more overhead, but I
have no objection about that part.

If we implemented mzData here today, our files would be bigger if we
used the base64 encoding than if we used the textual numbers (as they
are in our ms2 files).

> > 3.  Potential loss of precision information.  ...

> Actually the situtation may be reversed. Thermofinnigan, for=20
> example, stores measured values coming off of the instrument=20
> as double precision floats, later formatting the numbers as=20
> needed with respect to the specific instruments limit of detection.=20
> 12345.1 may have originally been 12345.099923123 in the vendors=20
> proprietary format.

Okay, but isn't '12345.1' what I really want to see in this case
(assuming that the vendor is correct about the instrument's accuracy)?
For this particular instance, the string '12345.1' tells me what I need
to know, and a double-precision floating point value (e.g.,
12345.10000000000036379) would sort of let me guess it (since
double-precision has significantly more significant figures).  But a
single-precision value would leave me in a sort of gray area.  That is,
does '12345.099923123' mean '12345.1' or '12345.10' or '12345.100', for
example?

> I wrote an email a few days ago showing how to translate in ruby=20
> the base64 arrays

I saw it and it was quite useful to me.  Part of the reason I'm asking
these questions is that I noticed in your examples that the
base64-encoded values actually took more space than the original data.

Just to reiterate my main question, it looks like using base64 will make
mzData less usable and more complex, as compared to straight text.  What
benefits come with it that offset these drawbacks?

Mike