Re: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tuesday 19 September 2006 18:58, Coleman, Michael wrote:
>
> I'm glad to hear that.  I looked for this, but I did not see it in the
> spec here
>
> http://psidev.sourceforge.net/ms/xml/mzdata/mzdata.html#element_mzData
>
I am cringing as I write this, since I really think you should not go this 
route, but look at the supplementary data tags.

>
> Well--I do.  As a programmer I don't think it's an exaggeration to say
> that I'm looking at the peak lists in our ms2 files every day.  I find
> being able to see at a glance that the peaks are basically sane, and
> their gross attributes (precision, count, etc.) very useful.

ah, yes, but most probably you have to either zcat the file or unzip it in 
order to read the floats, then zip the whole file back again once finished, a 
situation not unlike decoding byte arrays and base64 strings....

>
> Of course, as a programmer I can easy whip up a script to decode this
> file format.  I suspect most users would be stymied, though, and I think
> that that would be unfortunate.  Since these files are part of a chain
> of scientific argument, I think that as much as possible they ought to
> be transparent and as open as possible to verification by eyeball (mine
> and those of our scientists) and alternative pieces of software.
>

This is really where mzData has failed the end user, namely in the set of 
tools that support it. Even basic marshal/unmarshal scripts are lacking. The 
"Specify it and they will come.." development hasn't panned out for us, 
sadly, so I am starting a development cycle here at UPenn to address these 
needs. Specifically a reasonably fast ruby framework for dealing with mzData 
(akin to  some aspects of the TPP)  starting off based on some code written 
by John Prince @ UTexas, called mspire.

> Sorry, I should have been clearer.  The numbers I gave were just for the
> peak lists (base64 vs text) and nothing else--no tags, no other
> metadata.  The rest of the mzData fields would add more overhead, but I
> have no objection about that part.
>
> If we implemented mzData here today, our files would be bigger if we
> used the base64 encoding than if we used the textual numbers (as they
> are in our ms2 files).

Point taken. See Brian Pratt's responses as to why base64 is the way both 
mzData and mzXML are going (irrespective of the planned merge of the 
formats). I'll add to those arguments that we should look at the 
computational costs of un/zipping whole files as opposed to stream 
en/decoding individual mzData spectra.

>
> > > 3.  Potential loss of precision information.  ...

Brian Pratt addressed these issues much more eloquently than me in his reply.

>
> Just to reiterate my main question, it looks like using base64 will make
> mzData less usable and more complex, as compared to straight text.  What
> benefits come with it that offset these drawbacks?

1) it can handle encoding of integers, single and double precision float 
arrays without loss of information
2) comparable compression with zipped plain text of the same precision
3) better performance with respect to accessing individual spectra vs. 
compressed plain text

-angel