From: Randy J. <rkj...@in...> - 2006-09-20 14:13:06
|
This is a very interesting question which has come up several times before. As we work to develop dataXML (mzData 2.0) we should take all of these concerns into consideration. Originally, mzData had both a binary and regular XML notation for both data vectors. The XML-schema data types where tested by most of the vendors who did not see the file size compression benefits you mention because they did not feel they had the ability to round either of the vectors in the way you suggest. Since the use case: 'user opens mzData file with notepad and see peaks' was not viewed as a major request, the vendors unanimously voted the non-binary arrays out for size and performance reasons (see the meeting notes from the PSI meeting in Nice). The loss of readability may now have larger consequences than we considered back then. Steve Stein's comments are good ones. I we now have broad enough adoption that we want to be able to open the file and see the numbers written out in XML, then we should reconsider the validity of the use case. To do this with mzData 1.05 you would have to use the supplemental data vector (the alternative Angel suggested). The supplemental data vectors hold any type of XSD data type including normal XML. However in mzData 1.05, the binary vectors are not optional, so you have to populate them to comply with the spec - even if you repeat the information in the supplemental vector. The suggested 'white space separated list' is not a valid XML data type, so if we want to keep with the XSD standard for validation, the peak lists have to be in markup like: <peak> <mz> <float>0.1</float> </mz> <inten> <float>100.1</float> </inten> </peak> or something similar. Other semantics could reduce the verbosity, but the basic idea is that we can only use valid XSD data types. As we move to dataXML, we will need to store other data objects besides mass spectra (MRM chromatograms for example), so we will have to come up with a more general data section regardless of the data types allowed. During this design phase we should decide what data types we want. As a historical note, the previous (current) LC-MS standard format uses netCDF as the data representation which is fully binary and utterly unreadable in any respect without an API. Thus this situation has existed in mass spectrometry for quite some time. The readability of these files has never been viewed as a serious weakness, although the 1.5-2x increase in file size over the original vendor file was the source of constant complaint. Just as a note for your comment #3, this is not so straight forward. If the instrument collects data using an Intel chip, floating-point raw data will most likely have a IEEE-754 representation. So any time you have a number in a file like 0.1, the internal representation was originally different (0.1 cannot be exactly represented in IEEE-754). When you read from the file into an IEEE standard format, it will not be 0.1 in any of the math you do. Let the PSI-MS team know what requirements you would like to see the HUPO standards meet. If there is strong user support for missing features, the team will include them in the development roadmap. Let's keep the discussion of improvements going! Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Coleman, Michael Sent: Tuesday, September 19, 2006 4:39 PM To: psi...@li... Subject: [Psidev-ms-dev] Why base64? Hi, Does anyone know why base64 encoding is being used for peak mz and intensity values in the mzData format? It appears to me that there are three significant disadvantages to doing so: 1. Loss of readability. One of the primary reasons to use XML in the first place is that it is human-readable--one can in principle inspect and understand its contents with any text editor. Base64-encoding peak data destroys this transparency. (It also makes it more difficult to write scripts to process the data.) 2. Increased file size. At least for our spectra, it appears that a compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent mzData file with the single-precision (32-bit) encoding, and 22% smaller than the double-precision version. The *uncompressed* single-precision mzData file is about about 15% smaller than the uncompressed ms2 file; the double-precision version is almost twice as large. (These figures are for 'gzip' default compression.) (Currently our ms2 files have mz values rounded to one decimal place and intensity values with about 4-5 significant places.) 3. Potential loss of precision information. For example, with single-precision encoding, a value originally given as 12345.1 might be encoded as 12345.0996. It's not easy to see from that encoding that the original value was given with one decimal place. Worse-still, if the original value is significant to more than 7-or-so digits and it gets 32-bit encoded, precision will be lost, probably in a way not immediately apparent to the user. (32-bit encoding will probably be a temptation, given the size of the 64-bit encoding.) Even if base64-encoding cannot be dropped at this point, it seems like it would be useful to add a "no encode" option, which would present peak data as the obvious whitespace-separated list of numeric values. Am I missing something here? I could not find any discussion of this issue on the list. --Mike Mike Coleman, Scientific Programmer, +1 816 926 4419 Stowers Institute for Biomedical Research 1000 E. 50th St., Kansas City, MO 64110, USA ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |