Thread: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Does anyone know why base64 encoding is being used for peak mz and
intensity values in the mzData format?  It appears to me that there are
three significant disadvantages to doing so:

1.  Loss of readability.  One of the primary reasons to use XML in the
first place is that it is human-readable--one can in principle inspect
and understand its contents with any text editor.  Base64-encoding peak
data destroys this transparency.  (It also makes it more difficult to
write scripts to process the data.)

2.  Increased file size.  At least for our spectra, it appears that a
compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent
mzData file with the single-precision (32-bit) encoding, and 22% smaller
than the double-precision version.  The *uncompressed* single-precision
mzData file is about about 15% smaller than the uncompressed ms2 file;
the double-precision version is almost twice as large.  (These figures
are for 'gzip' default compression.)

(Currently our ms2 files have mz values rounded to one decimal place and
intensity values with about 4-5 significant places.)

3.  Potential loss of precision information.  For example, with
single-precision encoding, a value originally given as 12345.1 might be
encoded as 12345.0996.  It's not easy to see from that encoding that the
original value was given with one decimal place.  Worse-still, if the
original value is significant to more than 7-or-so digits and it gets
32-bit encoded, precision will be lost, probably in a way not
immediately apparent to the user.  (32-bit encoding will probably be a
temptation, given the size of the 64-bit encoding.)

Even if base64-encoding cannot be dropped at this point, it seems like
it would be useful to add a "no encode" option, which would present peak
data as the obvious whitespace-separated list of numeric values.

Am I missing something here?  I could not find any discussion of this
issue on the list. =20

--Mike

Mike Coleman, Scientific Programmer, +1 816 926 4419
Stowers Institute for Biomedical Research
1000 E. 50th St., Kansas City, MO  64110,  USA

Thread: [Psidev-ms-dev] Why base64?

psidev-ms-dev