From: Angel P. <an...@ma...> - 2006-09-19 21:38:56
|
Hi Mike, I have some answers that may or may not explain all of your concerns. On Tuesday 19 September 2006 16:39, Coleman, Michael wrote: > Hi, > > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that there are > three significant disadvantages to doing so: > > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) There actually is a space for "human readable spectra" in the mzData format, but really who reads individual mz and intensity values? The situation is akin to microarray data, does anyone really need to see each individual probe value? The normal usage of this data is to load the entire result set into a processing or search algorithm, or turn it into a nice spectra graph, all of which are handled by software which does not have a problem with decoding the strings. > > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent > mzData file with the single-precision (32-bit) encoding, and 22% smaller > than the double-precision version. The *uncompressed* single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) > Not a fair comparison. Most of the space in an mzData file is actually taken up by the human-readable parameters and parameter values of the spectra. I'll have to do some tests to see the actual space taken by spectra, but my "feeling" is that the byte and base64 encoding is actually a better compression of the data than gzipped XML with space delimitted floats. > (Currently our ms2 files have mz values rounded to one decimal place and > intensity values with about 4-5 significant places.) > > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as 12345.1 might be > encoded as 12345.0996. It's not easy to see from that encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) Actually the situtation may be reversed. Thermofinnigan, for example, stores measured values coming off of the instrument as double precision floats, later formatting the numbers as needed with respect to the specific instruments limit of detection. 12345.1 may have originally been 12345.099923123 in the vendors proprietary format. > > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would present peak > data as the obvious whitespace-separated list of numeric values. > See my remark about who really needs to see the raw numbers. I wrote an email a few days ago showing how to translate in ruby the base64 arrays, and there is also a java example posted with the mzData specification. > Am I missing something here? I could not find any discussion of this > issue on the list. > > --Mike > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... |