From: Jimmy E. <jk...@gm...> - 2006-09-20 17:12:23
|
I believe base64 encoding makes more sense for some large class of applications that will hopefully be digesting these files but I'm sure everyone can see the obvious benefits of plain text encoding of peak lists. The question I have is regarding the representation of space delimited lists as Lewis and Randy have drawn up. Does this address the needs of Michael, Steve, and Akhilesh and others? Hopefully they'll all chime in. My concern would be that having a horizontal, space separate list of numbers, where m/z and intensity will possibly be written in separate lists of floats and ints, doesn't really serve the notion of readability. Lots of folks are used to looking at lists of peaks as ordered in .mgf or .dta files and I'm not sure if a horizontal list of numbers (especially if it's 2 lists, one for m/z and one for intensity) gives you that same sense of readability. I don't really see any regular use case scenarios where people would be scrolling over to the 68th m/z in the list and then somehow counting over to the location of the 68th intensity to get its value. So _if_ this really doesn't address the needs of the folks who have concerns about the base64 encoding and would like like to see plain text, speak up. The last thing the format needs is more complexity in the form of another optional way of representing the data that only a handful of people will ever end up using. - Jimmy On 9/20/06, Randy Julian <rkj...@in...> wrote: > Hi, > > This works quite nicely! > > <?xml version="1.0" encoding="UTF-8"?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > elementFormDefault="qualified" attributeFormDefault="unqualified"> > <xs:element name="root"> > <xs:complexType> > <xs:sequence> > <xs:element name="MyList"> > <xs:simpleType> > <xs:list > itemType="xs:float"/> > </xs:simpleType> > </xs:element> > </xs:sequence> > </xs:complexType> > </xs:element> > </xs:schema> > > Validates: > > <?xml version="1.0" encoding="UTF-8"?> > <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:noNamespaceSchemaLocation="list.xsd"> > <MyList>1.1 1.2 1.3</MyList> > </root> > > Any thoughts about the use of this in the schema? > > Randy > > -----Original Message----- > From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] > Sent: Wednesday, September 20, 2006 10:27 AM > To: Randy Julian; psi...@li... > Subject: RE: [Psidev-ms-dev] Why base64? > > Hi, > > XML-schema does allow space delimited lists: > > <xsd:simpleType name="listOfMyIntType"> > <xsd:list itemType="integer"/> > </xsd:simpleType> > > <listOfMyInt>20003 15037 95977 95945</listOfMyInt> > > Lewis > > > > -----Original Message----- > > From: Randy Julian [mailto:rkj...@in...] > > Sent: Wednesday, September 20, 2006 10:12 AM > > To: psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > This is a very interesting question which has come up several > > times before. > > As we work to develop dataXML (mzData 2.0) we should take all of these > > concerns into consideration. > > > > Originally, mzData had both a binary and regular XML notation > > for both data > > vectors. The XML-schema data types where tested by most of > > the vendors who > > did not see the file size compression benefits you mention > > because they did > > not feel they had the ability to round either of the vectors > > in the way you > > suggest. Since the use case: 'user opens mzData file with > > notepad and see > > peaks' was not viewed as a major request, the vendors > > unanimously voted the > > non-binary arrays out for size and performance reasons (see > > the meeting > > notes from the PSI meeting in Nice). > > > > The loss of readability may now have larger consequences than > > we considered > > back then. Steve Stein's comments are good ones. I we now have broad > > enough adoption that we want to be able to open the file and > > see the numbers > > written out in XML, then we should reconsider the validity of > > the use case. > > To do this with mzData 1.05 you would have to use the > > supplemental data > > vector (the alternative Angel suggested). > > > > The supplemental data vectors hold any type of XSD data type including > > normal XML. However in mzData 1.05, the binary vectors are > > not optional, so > > you have to populate them to comply with the spec - even if > > you repeat the > > information in the supplemental vector. > > > > The suggested 'white space separated list' is not a valid XML > > data type, so > > if we want to keep with the XSD standard for validation, the > > peak lists have > > to be in markup like: > > > > <peak> > > <mz> > > <float>0.1</float> > > </mz> > > <inten> > > <float>100.1</float> > > </inten> > > </peak> > > > > or something similar. Other semantics could reduce the > > verbosity, but the > > basic idea is that we can only use valid XSD data types. > > > > As we move to dataXML, we will need to store other data > > objects besides mass > > spectra (MRM chromatograms for example), so we will have to > > come up with a > > more general data section regardless of the data types > > allowed. During this > > design phase we should decide what data types we want. > > > > As a historical note, the previous (current) LC-MS standard > > format uses > > netCDF as the data representation which is fully binary and utterly > > unreadable in any respect without an API. Thus this > > situation has existed > > in mass spectrometry for quite some time. The readability of > > these files > > has never been viewed as a serious weakness, although the > > 1.5-2x increase in > > file size over the original vendor file was the source of constant > > complaint. > > > > Just as a note for your comment #3, this is not so straight > > forward. If the > > instrument collects data using an Intel chip, floating-point > > raw data will > > most likely have a IEEE-754 representation. So any time you > > have a number > > in a file like 0.1, the internal representation was > > originally different > > (0.1 cannot be exactly represented in IEEE-754). When you > > read from the file > > into an IEEE standard format, it will not be 0.1 in any of > > the math you do. > > > > Let the PSI-MS team know what requirements you would like to > > see the HUPO > > standards meet. If there is strong user support for missing > > features, the > > team will include them in the development roadmap. > > > > Let's keep the discussion of improvements going! > > > > Randy > > > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Coleman, > > Michael > > Sent: Tuesday, September 19, 2006 4:39 PM > > To: psi...@li... > > Subject: [Psidev-ms-dev] Why base64? > > > > Hi, > > > > Does anyone know why base64 encoding is being used for peak mz and > > intensity values in the mzData format? It appears to me that > > there are > > three significant disadvantages to doing so: > > > > 1. Loss of readability. One of the primary reasons to use XML in the > > first place is that it is human-readable--one can in principle inspect > > and understand its contents with any text editor. > > Base64-encoding peak > > data destroys this transparency. (It also makes it more difficult to > > write scripts to process the data.) > > > > 2. Increased file size. At least for our spectra, it appears that a > > compressed (gzip/etc) ms2 file is about 15% smaller than the > > equivalent > > mzData file with the single-precision (32-bit) encoding, and > > 22% smaller > > than the double-precision version. The *uncompressed* > > single-precision > > mzData file is about about 15% smaller than the uncompressed ms2 file; > > the double-precision version is almost twice as large. (These figures > > are for 'gzip' default compression.) > > > > (Currently our ms2 files have mz values rounded to one > > decimal place and > > intensity values with about 4-5 significant places.) > > > > 3. Potential loss of precision information. For example, with > > single-precision encoding, a value originally given as > > 12345.1 might be > > encoded as 12345.0996. It's not easy to see from that > > encoding that the > > original value was given with one decimal place. Worse-still, if the > > original value is significant to more than 7-or-so digits and it gets > > 32-bit encoded, precision will be lost, probably in a way not > > immediately apparent to the user. (32-bit encoding will probably be a > > temptation, given the size of the 64-bit encoding.) > > > > Even if base64-encoding cannot be dropped at this point, it seems like > > it would be useful to add a "no encode" option, which would > > present peak > > data as the obvious whitespace-separated list of numeric values. > > > > Am I missing something here? I could not find any discussion of this > > issue on the list. > > > > --Mike > > > > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > > Stowers Institute for Biomedical Research > > 1000 E. 50th St., Kansas City, MO 64110, USA > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |