Re: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This is a very interesting question which has come up several times before.
As we work to develop dataXML (mzData 2.0) we should take all of these
concerns into consideration.

Originally, mzData had both a binary and regular XML notation for both data
vectors.  The XML-schema data types where tested by most of the vendors who
did not see the file size compression benefits you mention because they did
not feel they had the ability to round either of the vectors in the way you
suggest.  Since the use case: 'user opens mzData file with notepad and see
peaks' was not viewed as a major request, the vendors unanimously voted the
non-binary arrays out for size and performance reasons (see the meeting
notes from the PSI meeting in Nice).

The loss of readability may now have larger consequences than we considered
back then.  Steve Stein's comments are good ones.  I we now have broad
enough adoption that we want to be able to open the file and see the numbers
written out in XML, then we should reconsider the validity of the use case.
To do this with mzData 1.05 you would have to use the supplemental data
vector (the alternative Angel suggested).

The supplemental data vectors hold any type of XSD data type including
normal XML.  However in mzData 1.05, the binary vectors are not optional, so
you have to populate them to comply with the spec - even if you repeat the
information in the supplemental vector.

The suggested 'white space separated list' is not a valid XML data type, so
if we want to keep with the XSD standard for validation, the peak lists have
to be in markup like:

<peak>
  <mz>
    <float>0.1</float>
  </mz>
  <inten>
     <float>100.1</float>
  </inten>
</peak>

or something similar.  Other semantics could reduce the verbosity, but the
basic idea is that we can only use valid XSD data types.

As we move to dataXML, we will need to store other data objects besides mass
spectra (MRM chromatograms for example), so we will have to come up with a
more general data section regardless of the data types allowed.  During this
design phase we should decide what data types we want.

As a historical note, the previous (current) LC-MS standard format uses
netCDF as the data representation which is fully binary and utterly
unreadable in any respect without an API.  Thus this situation has existed
in mass spectrometry for quite some time.  The readability of these files
has never been viewed as a serious weakness, although the 1.5-2x increase in
file size over the original vendor file was the source of constant
complaint.

Just as a note for your comment #3, this is not so straight forward.  If the
instrument collects data using an Intel chip, floating-point raw data will
most likely have a IEEE-754 representation.  So any time you have a number
in a file like 0.1, the internal representation was originally different
(0.1 cannot be exactly represented in IEEE-754). When you read from the file
into an IEEE standard format, it will not be 0.1 in any of the math you do.

Let the PSI-MS team know what requirements you would like to see the HUPO
standards meet.  If there is strong user support for missing features, the
team will include them in the development roadmap.

Let's keep the discussion of improvements going!

Randy

-----Original Message-----
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Coleman,
Michael
Sent: Tuesday, September 19, 2006 4:39 PM
To: psi...@li...
Subject: [Psidev-ms-dev] Why base64?

Hi,

Does anyone know why base64 encoding is being used for peak mz and
intensity values in the mzData format?  It appears to me that there are
three significant disadvantages to doing so:

1.  Loss of readability.  One of the primary reasons to use XML in the
first place is that it is human-readable--one can in principle inspect
and understand its contents with any text editor.  Base64-encoding peak
data destroys this transparency.  (It also makes it more difficult to
write scripts to process the data.)

2.  Increased file size.  At least for our spectra, it appears that a
compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent
mzData file with the single-precision (32-bit) encoding, and 22% smaller
than the double-precision version.  The *uncompressed* single-precision
mzData file is about about 15% smaller than the uncompressed ms2 file;
the double-precision version is almost twice as large.  (These figures
are for 'gzip' default compression.)

(Currently our ms2 files have mz values rounded to one decimal place and
intensity values with about 4-5 significant places.)

3.  Potential loss of precision information.  For example, with
single-precision encoding, a value originally given as 12345.1 might be
encoded as 12345.0996.  It's not easy to see from that encoding that the
original value was given with one decimal place.  Worse-still, if the
original value is significant to more than 7-or-so digits and it gets
32-bit encoded, precision will be lost, probably in a way not
immediately apparent to the user.  (32-bit encoding will probably be a
temptation, given the size of the 64-bit encoding.)

Even if base64-encoding cannot be dropped at this point, it seems like
it would be useful to add a "no encode" option, which would present peak
data as the obvious whitespace-separated list of numeric values.

Am I missing something here?  I could not find any discussion of this
issue on the list.  

--Mike

Mike Coleman, Scientific Programmer, +1 816 926 4419
Stowers Institute for Biomedical Research
1000 E. 50th St., Kansas City, MO  64110,  USA

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Psidev-ms-dev mailing list
Psi...@li...
https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev