From: Geer, L. \(NIH/NLM/NCBI\) [E] <le...@nc...> - 2006-09-20 18:25:57
|
Hi, Jimmy, Sorry, should have said "whitespace delimited" instead of "space delimited" where XML considers whitespace to be a carriage return, a linefeed, a tab, and/or a space. As Michael implies, this means the numbers can sit on different lines and that there is no reason the numbers could be grouped so the first number is m/z, the second is intensity, etc. Lewis > -----Original Message----- > From: Jimmy Eng [mailto:jk...@gm...]=20 > Sent: Wednesday, September 20, 2006 1:12 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? >=20 > I believe base64 encoding makes more sense for some large class of > applications that will hopefully be digesting these files but I'm sure > everyone can see the obvious benefits of plain text encoding of peak > lists. >=20 > The question I have is regarding the representation of space delimited > lists as Lewis and Randy have drawn up. Does this address the needs > of Michael, Steve, and Akhilesh and others? Hopefully they'll all > chime in. My concern would be that having a horizontal, space > separate list of numbers, where m/z and intensity will possibly be > written in separate lists of floats and ints, doesn't really serve the > notion of readability. Lots of folks are used to looking at lists of > peaks as ordered in .mgf or .dta files and I'm not sure if a > horizontal list of numbers (especially if it's 2 lists, one for m/z > and one for intensity) gives you that same sense of readability. I > don't really see any regular use case scenarios where people would be > scrolling over to the 68th m/z in the list and then somehow counting > over to the location of the 68th intensity to get its value. >=20 > So _if_ this really doesn't address the needs of the folks who have > concerns about the base64 encoding and would like like to see plain > text, speak up. The last thing the format needs is more complexity > in the form of another optional way of representing the data that only > a handful of people will ever end up using. >=20 > - Jimmy >=20 >=20 > On 9/20/06, Randy Julian <rkj...@in...> wrote: > > Hi, > > > > This works quite nicely! > > > > <?xml version=3D"1.0" encoding=3D"UTF-8"?> > > <xs:schema xmlns:xs=3D"http://www.w3.org/2001/XMLSchema" > > elementFormDefault=3D"qualified" = attributeFormDefault=3D"unqualified"> > > <xs:element name=3D"root"> > > <xs:complexType> > > <xs:sequence> > > <xs:element name=3D"MyList"> > > <xs:simpleType> > > <xs:list > > itemType=3D"xs:float"/> > > </xs:simpleType> > > </xs:element> > > </xs:sequence> > > </xs:complexType> > > </xs:element> > > </xs:schema> > > > > Validates: > > > > <?xml version=3D"1.0" encoding=3D"UTF-8"?> > > <root xmlns:xsi=3D"http://www.w3.org/2001/XMLSchema-instance" > > xsi:noNamespaceSchemaLocation=3D"list.xsd"> > > <MyList>1.1 1.2 1.3</MyList> > > </root> > > > > Any thoughts about the use of this in the schema? > > > > Randy > > > > -----Original Message----- > > From: Geer, Lewis (NIH/NLM/NCBI) [E]=20 > [mailto:le...@nc...] > > Sent: Wednesday, September 20, 2006 10:27 AM > > To: Randy Julian; psi...@li... > > Subject: RE: [Psidev-ms-dev] Why base64? > > > > Hi, > > > > XML-schema does allow space delimited lists: > > > > <xsd:simpleType name=3D"listOfMyIntType"> > > <xsd:list itemType=3D"integer"/> > > </xsd:simpleType> > > > > <listOfMyInt>20003 15037 95977 95945</listOfMyInt> > > > > Lewis > > > > > > > -----Original Message----- > > > From: Randy Julian [mailto:rkj...@in...] > > > Sent: Wednesday, September 20, 2006 10:12 AM > > > To: psi...@li... > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > This is a very interesting question which has come up several > > > times before. > > > As we work to develop dataXML (mzData 2.0) we should take=20 > all of these > > > concerns into consideration. > > > > > > Originally, mzData had both a binary and regular XML notation > > > for both data > > > vectors. The XML-schema data types where tested by most of > > > the vendors who > > > did not see the file size compression benefits you mention > > > because they did > > > not feel they had the ability to round either of the vectors > > > in the way you > > > suggest. Since the use case: 'user opens mzData file with > > > notepad and see > > > peaks' was not viewed as a major request, the vendors > > > unanimously voted the > > > non-binary arrays out for size and performance reasons (see > > > the meeting > > > notes from the PSI meeting in Nice). > > > > > > The loss of readability may now have larger consequences than > > > we considered > > > back then. Steve Stein's comments are good ones. I we=20 > now have broad > > > enough adoption that we want to be able to open the file and > > > see the numbers > > > written out in XML, then we should reconsider the validity of > > > the use case. > > > To do this with mzData 1.05 you would have to use the > > > supplemental data > > > vector (the alternative Angel suggested). > > > > > > The supplemental data vectors hold any type of XSD data=20 > type including > > > normal XML. However in mzData 1.05, the binary vectors are > > > not optional, so > > > you have to populate them to comply with the spec - even if > > > you repeat the > > > information in the supplemental vector. > > > > > > The suggested 'white space separated list' is not a valid XML > > > data type, so > > > if we want to keep with the XSD standard for validation, the > > > peak lists have > > > to be in markup like: > > > > > > <peak> > > > <mz> > > > <float>0.1</float> > > > </mz> > > > <inten> > > > <float>100.1</float> > > > </inten> > > > </peak> > > > > > > or something similar. Other semantics could reduce the > > > verbosity, but the > > > basic idea is that we can only use valid XSD data types. > > > > > > As we move to dataXML, we will need to store other data > > > objects besides mass > > > spectra (MRM chromatograms for example), so we will have to > > > come up with a > > > more general data section regardless of the data types > > > allowed. During this > > > design phase we should decide what data types we want. > > > > > > As a historical note, the previous (current) LC-MS standard > > > format uses > > > netCDF as the data representation which is fully binary=20 > and utterly > > > unreadable in any respect without an API. Thus this > > > situation has existed > > > in mass spectrometry for quite some time. The readability of > > > these files > > > has never been viewed as a serious weakness, although the > > > 1.5-2x increase in > > > file size over the original vendor file was the source of constant > > > complaint. > > > > > > Just as a note for your comment #3, this is not so straight > > > forward. If the > > > instrument collects data using an Intel chip, floating-point > > > raw data will > > > most likely have a IEEE-754 representation. So any time you > > > have a number > > > in a file like 0.1, the internal representation was > > > originally different > > > (0.1 cannot be exactly represented in IEEE-754). When you > > > read from the file > > > into an IEEE standard format, it will not be 0.1 in any of > > > the math you do. > > > > > > Let the PSI-MS team know what requirements you would like to > > > see the HUPO > > > standards meet. If there is strong user support for missing > > > features, the > > > team will include them in the development roadmap. > > > > > > Let's keep the discussion of improvements going! > > > > > > Randy > > > > > > > > > -----Original Message----- > > > From: psi...@li... > > > [mailto:psi...@li...] On > > > Behalf Of Coleman, > > > Michael > > > Sent: Tuesday, September 19, 2006 4:39 PM > > > To: psi...@li... > > > Subject: [Psidev-ms-dev] Why base64? > > > > > > Hi, > > > > > > Does anyone know why base64 encoding is being used for peak mz and > > > intensity values in the mzData format? It appears to me that > > > there are > > > three significant disadvantages to doing so: > > > > > > 1. Loss of readability. One of the primary reasons to=20 > use XML in the > > > first place is that it is human-readable--one can in=20 > principle inspect > > > and understand its contents with any text editor. > > > Base64-encoding peak > > > data destroys this transparency. (It also makes it more=20 > difficult to > > > write scripts to process the data.) > > > > > > 2. Increased file size. At least for our spectra, it=20 > appears that a > > > compressed (gzip/etc) ms2 file is about 15% smaller than the > > > equivalent > > > mzData file with the single-precision (32-bit) encoding, and > > > 22% smaller > > > than the double-precision version. The *uncompressed* > > > single-precision > > > mzData file is about about 15% smaller than the=20 > uncompressed ms2 file; > > > the double-precision version is almost twice as large. =20 > (These figures > > > are for 'gzip' default compression.) > > > > > > (Currently our ms2 files have mz values rounded to one > > > decimal place and > > > intensity values with about 4-5 significant places.) > > > > > > 3. Potential loss of precision information. For example, with > > > single-precision encoding, a value originally given as > > > 12345.1 might be > > > encoded as 12345.0996. It's not easy to see from that > > > encoding that the > > > original value was given with one decimal place. =20 > Worse-still, if the > > > original value is significant to more than 7-or-so digits=20 > and it gets > > > 32-bit encoded, precision will be lost, probably in a way not > > > immediately apparent to the user. (32-bit encoding will=20 > probably be a > > > temptation, given the size of the 64-bit encoding.) > > > > > > Even if base64-encoding cannot be dropped at this point,=20 > it seems like > > > it would be useful to add a "no encode" option, which would > > > present peak > > > data as the obvious whitespace-separated list of numeric values. > > > > > > Am I missing something here? I could not find any=20 > discussion of this > > > issue on the list. > > > > > > --Mike > > > > > > > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > > > Stowers Institute for Biomedical Research > > > 1000 E. 50th St., Kansas City, MO 64110, USA > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > > > &CID=3DDEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > > > &CID=3DDEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > > > >=20 > -------------------------------------------------------------- > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > > opinions on IT & business topics through brief surveys --=20 > and earn cash > >=20 > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 |