From: Geer, L. \(NIH/NLM/NCBI\) [E] <le...@nc...> - 2006-09-20 14:27:04
|
Hi, XML-schema does allow space delimited lists: <xsd:simpleType name=3D"listOfMyIntType"> <xsd:list itemType=3D"integer"/> </xsd:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt> Lewis =20 > -----Original Message----- > From: Randy Julian [mailto:rkj...@in...]=20 > Sent: Wednesday, September 20, 2006 10:12 AM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? >=20 > This is a very interesting question which has come up several=20 > times before. > As we work to develop dataXML (mzData 2.0) we should take all of these > concerns into consideration. >=20 > Originally, mzData had both a binary and regular XML notation=20 > for both data > vectors. The XML-schema data types where tested by most of=20 > the vendors who > did not see the file size compression benefits you mention=20 > because they did > not feel they had the ability to round either of the vectors=20 > in the way you > suggest. Since the use case: 'user opens mzData file with=20 > notepad and see > peaks' was not viewed as a major request, the vendors=20 > unanimously voted the > non-binary arrays out for size and performance reasons (see=20 > the meeting > notes from the PSI meeting in Nice). >=20 > The loss of readability may now have larger consequences than=20 > we considered > back then. Steve Stein's comments are good ones. I we now have broad > enough adoption that we want to be able to open the file and=20 > see the numbers > written out in XML, then we should reconsider the validity of=20 > the use case. > To do this with mzData 1.05 you would have to use the=20 > supplemental data > vector (the alternative Angel suggested). >=20 > The supplemental data vectors hold any type of XSD data type including > normal XML. However in mzData 1.05, the binary vectors are=20 > not optional, so > you have to populate them to comply with the spec - even if=20 > you repeat the > information in the supplemental vector. >=20 > The suggested 'white space separated list' is not a valid XML=20 > data type, so > if we want to keep with the XSD standard for validation, the=20 > peak lists have > to be in markup like: >=20 > <peak> > <mz> > <float>0.1</float> > </mz> > <inten> > <float>100.1</float> > </inten> > </peak> >=20 > or something similar. Other semantics could reduce the=20 > verbosity, but the > basic idea is that we can only use valid XSD data types. >=20 > As we move to dataXML, we will need to store other data=20 > objects besides mass > spectra (MRM chromatograms for example), so we will have to=20 > come up with a > more general data section regardless of the data types=20 > allowed. During this > design phase we should decide what data types we want. >=20 > As a historical note, the previous (current) LC-MS standard=20 > format uses > netCDF as the data representation which is fully binary and utterly > unreadable in any respect without an API. Thus this=20 > situation has existed > in mass spectrometry for quite some time. The readability of=20 > these files > has never been viewed as a serious weakness, although the=20 > 1.5-2x increase in > file size over the original vendor file was the source of constant > complaint. >=20 > Just as a note for your comment #3, this is not so straight=20 > forward. If the > instrument collects data using an Intel chip, floating-point=20 > raw data will > most likely have a IEEE-754 representation. So any time you=20 > have a number > in a file like 0.1, the internal representation was=20 > originally different > (0.1 cannot be exactly represented in IEEE-754). When you=20 > read from the file > into an IEEE standard format, it will not be 0.1 in any of=20 > the math you do. >=20 > Let the PSI-MS team know what requirements you would like to=20 > see the HUPO > standards meet. If there is strong user support for missing=20 > features, the > team will include them in the development roadmap. >=20 > Let's keep the discussion of improvements going! >=20 > Randy >=20 >=20 > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On=20 > Behalf Of Coleman, > Michael > Sent: Tuesday, September 19, 2006 4:39 PM > To: psi...@li... > Subject: [Psidev-ms-dev] Why base64? >=20 > Hi, >=20 > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that=20 > there are > three significant disadvantages to doing so: >=20 > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. =20 > Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) >=20 > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the=20 > equivalent > mzData file with the single-precision (32-bit) encoding, and=20 > 22% smaller > than the double-precision version. The *uncompressed*=20 > single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) >=20 > (Currently our ms2 files have mz values rounded to one=20 > decimal place and > intensity values with about 4-5 significant places.) >=20 > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as=20 > 12345.1 might be > encoded as 12345.0996. It's not easy to see from that=20 > encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) >=20 > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would=20 > present peak > data as the obvious whitespace-separated list of numeric values. >=20 > Am I missing something here? I could not find any discussion of this > issue on the list. =20 >=20 > --Mike >=20 >=20 > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 |