From: Randy J. <rkj...@in...> - 2006-09-20 15:27:55
|
Hi, This works quite nicely! <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:element name="MyList"> <xs:simpleType> <xs:list itemType="xs:float"/> </xs:simpleType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> Validates: <?xml version="1.0" encoding="UTF-8"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="list.xsd"> <MyList>1.1 1.2 1.3</MyList> </root> Any thoughts about the use of this in the schema? Randy -----Original Message----- From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] Sent: Wednesday, September 20, 2006 10:27 AM To: Randy Julian; psi...@li... Subject: RE: [Psidev-ms-dev] Why base64? Hi, XML-schema does allow space delimited lists: <xsd:simpleType name="listOfMyIntType"> <xsd:list itemType="integer"/> </xsd:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt> Lewis > -----Original Message----- > From: Randy Julian [mailto:rkj...@in...] > Sent: Wednesday, September 20, 2006 10:12 AM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > This is a very interesting question which has come up several > times before. > As we work to develop dataXML (mzData 2.0) we should take all of these > concerns into consideration. > > Originally, mzData had both a binary and regular XML notation > for both data > vectors. The XML-schema data types where tested by most of > the vendors who > did not see the file size compression benefits you mention > because they did > not feel they had the ability to round either of the vectors > in the way you > suggest. Since the use case: 'user opens mzData file with > notepad and see > peaks' was not viewed as a major request, the vendors > unanimously voted the > non-binary arrays out for size and performance reasons (see > the meeting > notes from the PSI meeting in Nice). > > The loss of readability may now have larger consequences than > we considered > back then. Steve Stein's comments are good ones. I we now have broad > enough adoption that we want to be able to open the file and > see the numbers > written out in XML, then we should reconsider the validity of > the use case. > To do this with mzData 1.05 you would have to use the > supplemental data > vector (the alternative Angel suggested). > > The supplemental data vectors hold any type of XSD data type including > normal XML. However in mzData 1.05, the binary vectors are > not optional, so > you have to populate them to comply with the spec - even if > you repeat the > information in the supplemental vector. > > The suggested 'white space separated list' is not a valid XML > data type, so > if we want to keep with the XSD standard for validation, the > peak lists have > to be in markup like: > > <peak> > <mz> > <float>0.1</float> > </mz> > <inten> > <float>100.1</float> > </inten> > </peak> > > or something similar. Other semantics could reduce the > verbosity, but the > basic idea is that we can only use valid XSD data types. > > As we move to dataXML, we will need to store other data > objects besides mass > spectra (MRM chromatograms for example), so we will have to > come up with a > more general data section regardless of the data types > allowed. During this > design phase we should decide what data types we want. > > As a historical note, the previous (current) LC-MS standard > format uses > netCDF as the data representation which is fully binary and utterly > unreadable in any respect without an API. Thus this > situation has existed > in mass spectrometry for quite some time. The readability of > these files > has never been viewed as a serious weakness, although the > 1.5-2x increase in > file size over the original vendor file was the source of constant > complaint. > > Just as a note for your comment #3, this is not so straight > forward. If the > instrument collects data using an Intel chip, floating-point > raw data will > most likely have a IEEE-754 representation. So any time you > have a number > in a file like 0.1, the internal representation was > originally different > (0.1 cannot be exactly represented in IEEE-754). When you > read from the file > into an IEEE standard format, it will not be 0.1 in any of > the math you do. > > Let the PSI-MS team know what requirements you would like to > see the HUPO > standards meet. If there is strong user support for missing > features, the > team will include them in the development roadmap. > > Let's keep the discussion of improvements going! > > Randy > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Coleman, > Michael > Sent: Tuesday, September 19, 2006 4:39 PM > To: psi...@li... > Subject: [Psidev-ms-dev] Why base64? > > Hi, > > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that > there are > three significant disadvantages to doing so: > > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. > Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) > > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the > equivalent > mzData file with the single-precision (32-bit) encoding, and > 22% smaller > than the double-precision version. The *uncompressed* > single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) > > (Currently our ms2 files have mz values rounded to one > decimal place and > intensity values with about 4-5 significant places.) > > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as > 12345.1 might be > encoded as 12345.0996. It's not easy to see from that > encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) > > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would > present peak > data as the obvious whitespace-separated list of numeric values. > > Am I missing something here? I could not find any discussion of this > issue on the list. > > --Mike > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |