From: Coleman, M. <MK...@St...> - 2006-09-19 20:39:31
|
Hi, Does anyone know why base64 encoding is being used for peak mz and intensity values in the mzData format? It appears to me that there are three significant disadvantages to doing so: 1. Loss of readability. One of the primary reasons to use XML in the first place is that it is human-readable--one can in principle inspect and understand its contents with any text editor. Base64-encoding peak data destroys this transparency. (It also makes it more difficult to write scripts to process the data.) 2. Increased file size. At least for our spectra, it appears that a compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent mzData file with the single-precision (32-bit) encoding, and 22% smaller than the double-precision version. The *uncompressed* single-precision mzData file is about about 15% smaller than the uncompressed ms2 file; the double-precision version is almost twice as large. (These figures are for 'gzip' default compression.) (Currently our ms2 files have mz values rounded to one decimal place and intensity values with about 4-5 significant places.) 3. Potential loss of precision information. For example, with single-precision encoding, a value originally given as 12345.1 might be encoded as 12345.0996. It's not easy to see from that encoding that the original value was given with one decimal place. Worse-still, if the original value is significant to more than 7-or-so digits and it gets 32-bit encoded, precision will be lost, probably in a way not immediately apparent to the user. (32-bit encoding will probably be a temptation, given the size of the 64-bit encoding.) Even if base64-encoding cannot be dropped at this point, it seems like it would be useful to add a "no encode" option, which would present peak data as the obvious whitespace-separated list of numeric values. Am I missing something here? I could not find any discussion of this issue on the list. =20 --Mike Mike Coleman, Scientific Programmer, +1 816 926 4419 Stowers Institute for Biomedical Research 1000 E. 50th St., Kansas City, MO 64110, USA |
From: Angel P. <an...@ma...> - 2006-09-19 21:38:56
|
Hi Mike, I have some answers that may or may not explain all of your concerns. On Tuesday 19 September 2006 16:39, Coleman, Michael wrote: > Hi, > > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that there are > three significant disadvantages to doing so: > > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) There actually is a space for "human readable spectra" in the mzData format, but really who reads individual mz and intensity values? The situation is akin to microarray data, does anyone really need to see each individual probe value? The normal usage of this data is to load the entire result set into a processing or search algorithm, or turn it into a nice spectra graph, all of which are handled by software which does not have a problem with decoding the strings. > > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent > mzData file with the single-precision (32-bit) encoding, and 22% smaller > than the double-precision version. The *uncompressed* single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) > Not a fair comparison. Most of the space in an mzData file is actually taken up by the human-readable parameters and parameter values of the spectra. I'll have to do some tests to see the actual space taken by spectra, but my "feeling" is that the byte and base64 encoding is actually a better compression of the data than gzipped XML with space delimitted floats. > (Currently our ms2 files have mz values rounded to one decimal place and > intensity values with about 4-5 significant places.) > > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as 12345.1 might be > encoded as 12345.0996. It's not easy to see from that encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) Actually the situtation may be reversed. Thermofinnigan, for example, stores measured values coming off of the instrument as double precision floats, later formatting the numbers as needed with respect to the specific instruments limit of detection. 12345.1 may have originally been 12345.099923123 in the vendors proprietary format. > > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would present peak > data as the obvious whitespace-separated list of numeric values. > See my remark about who really needs to see the raw numbers. I wrote an email a few days ago showing how to translate in ruby the base64 arrays, and there is also a java example posted with the mzData specification. > Am I missing something here? I could not find any discussion of this > issue on the list. > > --Mike > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: an...@ma... |
From: Randy J. <rkj...@in...> - 2006-09-20 14:13:06
|
This is a very interesting question which has come up several times before. As we work to develop dataXML (mzData 2.0) we should take all of these concerns into consideration. Originally, mzData had both a binary and regular XML notation for both data vectors. The XML-schema data types where tested by most of the vendors who did not see the file size compression benefits you mention because they did not feel they had the ability to round either of the vectors in the way you suggest. Since the use case: 'user opens mzData file with notepad and see peaks' was not viewed as a major request, the vendors unanimously voted the non-binary arrays out for size and performance reasons (see the meeting notes from the PSI meeting in Nice). The loss of readability may now have larger consequences than we considered back then. Steve Stein's comments are good ones. I we now have broad enough adoption that we want to be able to open the file and see the numbers written out in XML, then we should reconsider the validity of the use case. To do this with mzData 1.05 you would have to use the supplemental data vector (the alternative Angel suggested). The supplemental data vectors hold any type of XSD data type including normal XML. However in mzData 1.05, the binary vectors are not optional, so you have to populate them to comply with the spec - even if you repeat the information in the supplemental vector. The suggested 'white space separated list' is not a valid XML data type, so if we want to keep with the XSD standard for validation, the peak lists have to be in markup like: <peak> <mz> <float>0.1</float> </mz> <inten> <float>100.1</float> </inten> </peak> or something similar. Other semantics could reduce the verbosity, but the basic idea is that we can only use valid XSD data types. As we move to dataXML, we will need to store other data objects besides mass spectra (MRM chromatograms for example), so we will have to come up with a more general data section regardless of the data types allowed. During this design phase we should decide what data types we want. As a historical note, the previous (current) LC-MS standard format uses netCDF as the data representation which is fully binary and utterly unreadable in any respect without an API. Thus this situation has existed in mass spectrometry for quite some time. The readability of these files has never been viewed as a serious weakness, although the 1.5-2x increase in file size over the original vendor file was the source of constant complaint. Just as a note for your comment #3, this is not so straight forward. If the instrument collects data using an Intel chip, floating-point raw data will most likely have a IEEE-754 representation. So any time you have a number in a file like 0.1, the internal representation was originally different (0.1 cannot be exactly represented in IEEE-754). When you read from the file into an IEEE standard format, it will not be 0.1 in any of the math you do. Let the PSI-MS team know what requirements you would like to see the HUPO standards meet. If there is strong user support for missing features, the team will include them in the development roadmap. Let's keep the discussion of improvements going! Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Coleman, Michael Sent: Tuesday, September 19, 2006 4:39 PM To: psi...@li... Subject: [Psidev-ms-dev] Why base64? Hi, Does anyone know why base64 encoding is being used for peak mz and intensity values in the mzData format? It appears to me that there are three significant disadvantages to doing so: 1. Loss of readability. One of the primary reasons to use XML in the first place is that it is human-readable--one can in principle inspect and understand its contents with any text editor. Base64-encoding peak data destroys this transparency. (It also makes it more difficult to write scripts to process the data.) 2. Increased file size. At least for our spectra, it appears that a compressed (gzip/etc) ms2 file is about 15% smaller than the equivalent mzData file with the single-precision (32-bit) encoding, and 22% smaller than the double-precision version. The *uncompressed* single-precision mzData file is about about 15% smaller than the uncompressed ms2 file; the double-precision version is almost twice as large. (These figures are for 'gzip' default compression.) (Currently our ms2 files have mz values rounded to one decimal place and intensity values with about 4-5 significant places.) 3. Potential loss of precision information. For example, with single-precision encoding, a value originally given as 12345.1 might be encoded as 12345.0996. It's not easy to see from that encoding that the original value was given with one decimal place. Worse-still, if the original value is significant to more than 7-or-so digits and it gets 32-bit encoded, precision will be lost, probably in a way not immediately apparent to the user. (32-bit encoding will probably be a temptation, given the size of the 64-bit encoding.) Even if base64-encoding cannot be dropped at this point, it seems like it would be useful to add a "no encode" option, which would present peak data as the obvious whitespace-separated list of numeric values. Am I missing something here? I could not find any discussion of this issue on the list. --Mike Mike Coleman, Scientific Programmer, +1 816 926 4419 Stowers Institute for Biomedical Research 1000 E. 50th St., Kansas City, MO 64110, USA ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Geer, L. \(NIH/NLM/NCBI\) [E] <le...@nc...> - 2006-09-20 14:27:04
|
Hi, XML-schema does allow space delimited lists: <xsd:simpleType name=3D"listOfMyIntType"> <xsd:list itemType=3D"integer"/> </xsd:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt> Lewis =20 > -----Original Message----- > From: Randy Julian [mailto:rkj...@in...]=20 > Sent: Wednesday, September 20, 2006 10:12 AM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? >=20 > This is a very interesting question which has come up several=20 > times before. > As we work to develop dataXML (mzData 2.0) we should take all of these > concerns into consideration. >=20 > Originally, mzData had both a binary and regular XML notation=20 > for both data > vectors. The XML-schema data types where tested by most of=20 > the vendors who > did not see the file size compression benefits you mention=20 > because they did > not feel they had the ability to round either of the vectors=20 > in the way you > suggest. Since the use case: 'user opens mzData file with=20 > notepad and see > peaks' was not viewed as a major request, the vendors=20 > unanimously voted the > non-binary arrays out for size and performance reasons (see=20 > the meeting > notes from the PSI meeting in Nice). >=20 > The loss of readability may now have larger consequences than=20 > we considered > back then. Steve Stein's comments are good ones. I we now have broad > enough adoption that we want to be able to open the file and=20 > see the numbers > written out in XML, then we should reconsider the validity of=20 > the use case. > To do this with mzData 1.05 you would have to use the=20 > supplemental data > vector (the alternative Angel suggested). >=20 > The supplemental data vectors hold any type of XSD data type including > normal XML. However in mzData 1.05, the binary vectors are=20 > not optional, so > you have to populate them to comply with the spec - even if=20 > you repeat the > information in the supplemental vector. >=20 > The suggested 'white space separated list' is not a valid XML=20 > data type, so > if we want to keep with the XSD standard for validation, the=20 > peak lists have > to be in markup like: >=20 > <peak> > <mz> > <float>0.1</float> > </mz> > <inten> > <float>100.1</float> > </inten> > </peak> >=20 > or something similar. Other semantics could reduce the=20 > verbosity, but the > basic idea is that we can only use valid XSD data types. >=20 > As we move to dataXML, we will need to store other data=20 > objects besides mass > spectra (MRM chromatograms for example), so we will have to=20 > come up with a > more general data section regardless of the data types=20 > allowed. During this > design phase we should decide what data types we want. >=20 > As a historical note, the previous (current) LC-MS standard=20 > format uses > netCDF as the data representation which is fully binary and utterly > unreadable in any respect without an API. Thus this=20 > situation has existed > in mass spectrometry for quite some time. The readability of=20 > these files > has never been viewed as a serious weakness, although the=20 > 1.5-2x increase in > file size over the original vendor file was the source of constant > complaint. >=20 > Just as a note for your comment #3, this is not so straight=20 > forward. If the > instrument collects data using an Intel chip, floating-point=20 > raw data will > most likely have a IEEE-754 representation. So any time you=20 > have a number > in a file like 0.1, the internal representation was=20 > originally different > (0.1 cannot be exactly represented in IEEE-754). When you=20 > read from the file > into an IEEE standard format, it will not be 0.1 in any of=20 > the math you do. >=20 > Let the PSI-MS team know what requirements you would like to=20 > see the HUPO > standards meet. If there is strong user support for missing=20 > features, the > team will include them in the development roadmap. >=20 > Let's keep the discussion of improvements going! >=20 > Randy >=20 >=20 > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On=20 > Behalf Of Coleman, > Michael > Sent: Tuesday, September 19, 2006 4:39 PM > To: psi...@li... > Subject: [Psidev-ms-dev] Why base64? >=20 > Hi, >=20 > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that=20 > there are > three significant disadvantages to doing so: >=20 > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. =20 > Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) >=20 > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the=20 > equivalent > mzData file with the single-precision (32-bit) encoding, and=20 > 22% smaller > than the double-precision version. The *uncompressed*=20 > single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) >=20 > (Currently our ms2 files have mz values rounded to one=20 > decimal place and > intensity values with about 4-5 significant places.) >=20 > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as=20 > 12345.1 might be > encoded as 12345.0996. It's not easy to see from that=20 > encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) >=20 > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would=20 > present peak > data as the obvious whitespace-separated list of numeric values. >=20 > Am I missing something here? I could not find any discussion of this > issue on the list. =20 >=20 > --Mike >=20 >=20 > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 |
From: Randy J. <rkj...@in...> - 2006-09-20 15:27:55
|
Hi, This works quite nicely! <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:element name="root"> <xs:complexType> <xs:sequence> <xs:element name="MyList"> <xs:simpleType> <xs:list itemType="xs:float"/> </xs:simpleType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> Validates: <?xml version="1.0" encoding="UTF-8"?> <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="list.xsd"> <MyList>1.1 1.2 1.3</MyList> </root> Any thoughts about the use of this in the schema? Randy -----Original Message----- From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] Sent: Wednesday, September 20, 2006 10:27 AM To: Randy Julian; psi...@li... Subject: RE: [Psidev-ms-dev] Why base64? Hi, XML-schema does allow space delimited lists: <xsd:simpleType name="listOfMyIntType"> <xsd:list itemType="integer"/> </xsd:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt> Lewis > -----Original Message----- > From: Randy Julian [mailto:rkj...@in...] > Sent: Wednesday, September 20, 2006 10:12 AM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > This is a very interesting question which has come up several > times before. > As we work to develop dataXML (mzData 2.0) we should take all of these > concerns into consideration. > > Originally, mzData had both a binary and regular XML notation > for both data > vectors. The XML-schema data types where tested by most of > the vendors who > did not see the file size compression benefits you mention > because they did > not feel they had the ability to round either of the vectors > in the way you > suggest. Since the use case: 'user opens mzData file with > notepad and see > peaks' was not viewed as a major request, the vendors > unanimously voted the > non-binary arrays out for size and performance reasons (see > the meeting > notes from the PSI meeting in Nice). > > The loss of readability may now have larger consequences than > we considered > back then. Steve Stein's comments are good ones. I we now have broad > enough adoption that we want to be able to open the file and > see the numbers > written out in XML, then we should reconsider the validity of > the use case. > To do this with mzData 1.05 you would have to use the > supplemental data > vector (the alternative Angel suggested). > > The supplemental data vectors hold any type of XSD data type including > normal XML. However in mzData 1.05, the binary vectors are > not optional, so > you have to populate them to comply with the spec - even if > you repeat the > information in the supplemental vector. > > The suggested 'white space separated list' is not a valid XML > data type, so > if we want to keep with the XSD standard for validation, the > peak lists have > to be in markup like: > > <peak> > <mz> > <float>0.1</float> > </mz> > <inten> > <float>100.1</float> > </inten> > </peak> > > or something similar. Other semantics could reduce the > verbosity, but the > basic idea is that we can only use valid XSD data types. > > As we move to dataXML, we will need to store other data > objects besides mass > spectra (MRM chromatograms for example), so we will have to > come up with a > more general data section regardless of the data types > allowed. During this > design phase we should decide what data types we want. > > As a historical note, the previous (current) LC-MS standard > format uses > netCDF as the data representation which is fully binary and utterly > unreadable in any respect without an API. Thus this > situation has existed > in mass spectrometry for quite some time. The readability of > these files > has never been viewed as a serious weakness, although the > 1.5-2x increase in > file size over the original vendor file was the source of constant > complaint. > > Just as a note for your comment #3, this is not so straight > forward. If the > instrument collects data using an Intel chip, floating-point > raw data will > most likely have a IEEE-754 representation. So any time you > have a number > in a file like 0.1, the internal representation was > originally different > (0.1 cannot be exactly represented in IEEE-754). When you > read from the file > into an IEEE standard format, it will not be 0.1 in any of > the math you do. > > Let the PSI-MS team know what requirements you would like to > see the HUPO > standards meet. If there is strong user support for missing > features, the > team will include them in the development roadmap. > > Let's keep the discussion of improvements going! > > Randy > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Coleman, > Michael > Sent: Tuesday, September 19, 2006 4:39 PM > To: psi...@li... > Subject: [Psidev-ms-dev] Why base64? > > Hi, > > Does anyone know why base64 encoding is being used for peak mz and > intensity values in the mzData format? It appears to me that > there are > three significant disadvantages to doing so: > > 1. Loss of readability. One of the primary reasons to use XML in the > first place is that it is human-readable--one can in principle inspect > and understand its contents with any text editor. > Base64-encoding peak > data destroys this transparency. (It also makes it more difficult to > write scripts to process the data.) > > 2. Increased file size. At least for our spectra, it appears that a > compressed (gzip/etc) ms2 file is about 15% smaller than the > equivalent > mzData file with the single-precision (32-bit) encoding, and > 22% smaller > than the double-precision version. The *uncompressed* > single-precision > mzData file is about about 15% smaller than the uncompressed ms2 file; > the double-precision version is almost twice as large. (These figures > are for 'gzip' default compression.) > > (Currently our ms2 files have mz values rounded to one > decimal place and > intensity values with about 4-5 significant places.) > > 3. Potential loss of precision information. For example, with > single-precision encoding, a value originally given as > 12345.1 might be > encoded as 12345.0996. It's not easy to see from that > encoding that the > original value was given with one decimal place. Worse-still, if the > original value is significant to more than 7-or-so digits and it gets > 32-bit encoded, precision will be lost, probably in a way not > immediately apparent to the user. (32-bit encoding will probably be a > temptation, given the size of the 64-bit encoding.) > > Even if base64-encoding cannot be dropped at this point, it seems like > it would be useful to add a "no encode" option, which would > present peak > data as the obvious whitespace-separated list of numeric values. > > Am I missing something here? I could not find any discussion of this > issue on the list. > > --Mike > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > Stowers Institute for Biomedical Research > 1000 E. 50th St., Kansas City, MO 64110, USA > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Jimmy E. <jk...@gm...> - 2006-09-20 17:12:23
|
I believe base64 encoding makes more sense for some large class of applications that will hopefully be digesting these files but I'm sure everyone can see the obvious benefits of plain text encoding of peak lists. The question I have is regarding the representation of space delimited lists as Lewis and Randy have drawn up. Does this address the needs of Michael, Steve, and Akhilesh and others? Hopefully they'll all chime in. My concern would be that having a horizontal, space separate list of numbers, where m/z and intensity will possibly be written in separate lists of floats and ints, doesn't really serve the notion of readability. Lots of folks are used to looking at lists of peaks as ordered in .mgf or .dta files and I'm not sure if a horizontal list of numbers (especially if it's 2 lists, one for m/z and one for intensity) gives you that same sense of readability. I don't really see any regular use case scenarios where people would be scrolling over to the 68th m/z in the list and then somehow counting over to the location of the 68th intensity to get its value. So _if_ this really doesn't address the needs of the folks who have concerns about the base64 encoding and would like like to see plain text, speak up. The last thing the format needs is more complexity in the form of another optional way of representing the data that only a handful of people will ever end up using. - Jimmy On 9/20/06, Randy Julian <rkj...@in...> wrote: > Hi, > > This works quite nicely! > > <?xml version="1.0" encoding="UTF-8"?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > elementFormDefault="qualified" attributeFormDefault="unqualified"> > <xs:element name="root"> > <xs:complexType> > <xs:sequence> > <xs:element name="MyList"> > <xs:simpleType> > <xs:list > itemType="xs:float"/> > </xs:simpleType> > </xs:element> > </xs:sequence> > </xs:complexType> > </xs:element> > </xs:schema> > > Validates: > > <?xml version="1.0" encoding="UTF-8"?> > <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:noNamespaceSchemaLocation="list.xsd"> > <MyList>1.1 1.2 1.3</MyList> > </root> > > Any thoughts about the use of this in the schema? > > Randy > > -----Original Message----- > From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] > Sent: Wednesday, September 20, 2006 10:27 AM > To: Randy Julian; psi...@li... > Subject: RE: [Psidev-ms-dev] Why base64? > > Hi, > > XML-schema does allow space delimited lists: > > <xsd:simpleType name="listOfMyIntType"> > <xsd:list itemType="integer"/> > </xsd:simpleType> > > <listOfMyInt>20003 15037 95977 95945</listOfMyInt> > > Lewis > > > > -----Original Message----- > > From: Randy Julian [mailto:rkj...@in...] > > Sent: Wednesday, September 20, 2006 10:12 AM > > To: psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > This is a very interesting question which has come up several > > times before. > > As we work to develop dataXML (mzData 2.0) we should take all of these > > concerns into consideration. > > > > Originally, mzData had both a binary and regular XML notation > > for both data > > vectors. The XML-schema data types where tested by most of > > the vendors who > > did not see the file size compression benefits you mention > > because they did > > not feel they had the ability to round either of the vectors > > in the way you > > suggest. Since the use case: 'user opens mzData file with > > notepad and see > > peaks' was not viewed as a major request, the vendors > > unanimously voted the > > non-binary arrays out for size and performance reasons (see > > the meeting > > notes from the PSI meeting in Nice). > > > > The loss of readability may now have larger consequences than > > we considered > > back then. Steve Stein's comments are good ones. I we now have broad > > enough adoption that we want to be able to open the file and > > see the numbers > > written out in XML, then we should reconsider the validity of > > the use case. > > To do this with mzData 1.05 you would have to use the > > supplemental data > > vector (the alternative Angel suggested). > > > > The supplemental data vectors hold any type of XSD data type including > > normal XML. However in mzData 1.05, the binary vectors are > > not optional, so > > you have to populate them to comply with the spec - even if > > you repeat the > > information in the supplemental vector. > > > > The suggested 'white space separated list' is not a valid XML > > data type, so > > if we want to keep with the XSD standard for validation, the > > peak lists have > > to be in markup like: > > > > <peak> > > <mz> > > <float>0.1</float> > > </mz> > > <inten> > > <float>100.1</float> > > </inten> > > </peak> > > > > or something similar. Other semantics could reduce the > > verbosity, but the > > basic idea is that we can only use valid XSD data types. > > > > As we move to dataXML, we will need to store other data > > objects besides mass > > spectra (MRM chromatograms for example), so we will have to > > come up with a > > more general data section regardless of the data types > > allowed. During this > > design phase we should decide what data types we want. > > > > As a historical note, the previous (current) LC-MS standard > > format uses > > netCDF as the data representation which is fully binary and utterly > > unreadable in any respect without an API. Thus this > > situation has existed > > in mass spectrometry for quite some time. The readability of > > these files > > has never been viewed as a serious weakness, although the > > 1.5-2x increase in > > file size over the original vendor file was the source of constant > > complaint. > > > > Just as a note for your comment #3, this is not so straight > > forward. If the > > instrument collects data using an Intel chip, floating-point > > raw data will > > most likely have a IEEE-754 representation. So any time you > > have a number > > in a file like 0.1, the internal representation was > > originally different > > (0.1 cannot be exactly represented in IEEE-754). When you > > read from the file > > into an IEEE standard format, it will not be 0.1 in any of > > the math you do. > > > > Let the PSI-MS team know what requirements you would like to > > see the HUPO > > standards meet. If there is strong user support for missing > > features, the > > team will include them in the development roadmap. > > > > Let's keep the discussion of improvements going! > > > > Randy > > > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Coleman, > > Michael > > Sent: Tuesday, September 19, 2006 4:39 PM > > To: psi...@li... > > Subject: [Psidev-ms-dev] Why base64? > > > > Hi, > > > > Does anyone know why base64 encoding is being used for peak mz and > > intensity values in the mzData format? It appears to me that > > there are > > three significant disadvantages to doing so: > > > > 1. Loss of readability. One of the primary reasons to use XML in the > > first place is that it is human-readable--one can in principle inspect > > and understand its contents with any text editor. > > Base64-encoding peak > > data destroys this transparency. (It also makes it more difficult to > > write scripts to process the data.) > > > > 2. Increased file size. At least for our spectra, it appears that a > > compressed (gzip/etc) ms2 file is about 15% smaller than the > > equivalent > > mzData file with the single-precision (32-bit) encoding, and > > 22% smaller > > than the double-precision version. The *uncompressed* > > single-precision > > mzData file is about about 15% smaller than the uncompressed ms2 file; > > the double-precision version is almost twice as large. (These figures > > are for 'gzip' default compression.) > > > > (Currently our ms2 files have mz values rounded to one > > decimal place and > > intensity values with about 4-5 significant places.) > > > > 3. Potential loss of precision information. For example, with > > single-precision encoding, a value originally given as > > 12345.1 might be > > encoded as 12345.0996. It's not easy to see from that > > encoding that the > > original value was given with one decimal place. Worse-still, if the > > original value is significant to more than 7-or-so digits and it gets > > 32-bit encoded, precision will be lost, probably in a way not > > immediately apparent to the user. (32-bit encoding will probably be a > > temptation, given the size of the 64-bit encoding.) > > > > Even if base64-encoding cannot be dropped at this point, it seems like > > it would be useful to add a "no encode" option, which would > > present peak > > data as the obvious whitespace-separated list of numeric values. > > > > Am I missing something here? I could not find any discussion of this > > issue on the list. > > > > --Mike > > > > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > > Stowers Institute for Biomedical Research > > 1000 E. 50th St., Kansas City, MO 64110, USA > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Geer, L. \(NIH/NLM/NCBI\) [E] <le...@nc...> - 2006-09-20 18:25:57
|
Hi, Jimmy, Sorry, should have said "whitespace delimited" instead of "space delimited" where XML considers whitespace to be a carriage return, a linefeed, a tab, and/or a space. As Michael implies, this means the numbers can sit on different lines and that there is no reason the numbers could be grouped so the first number is m/z, the second is intensity, etc. Lewis > -----Original Message----- > From: Jimmy Eng [mailto:jk...@gm...]=20 > Sent: Wednesday, September 20, 2006 1:12 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? >=20 > I believe base64 encoding makes more sense for some large class of > applications that will hopefully be digesting these files but I'm sure > everyone can see the obvious benefits of plain text encoding of peak > lists. >=20 > The question I have is regarding the representation of space delimited > lists as Lewis and Randy have drawn up. Does this address the needs > of Michael, Steve, and Akhilesh and others? Hopefully they'll all > chime in. My concern would be that having a horizontal, space > separate list of numbers, where m/z and intensity will possibly be > written in separate lists of floats and ints, doesn't really serve the > notion of readability. Lots of folks are used to looking at lists of > peaks as ordered in .mgf or .dta files and I'm not sure if a > horizontal list of numbers (especially if it's 2 lists, one for m/z > and one for intensity) gives you that same sense of readability. I > don't really see any regular use case scenarios where people would be > scrolling over to the 68th m/z in the list and then somehow counting > over to the location of the 68th intensity to get its value. >=20 > So _if_ this really doesn't address the needs of the folks who have > concerns about the base64 encoding and would like like to see plain > text, speak up. The last thing the format needs is more complexity > in the form of another optional way of representing the data that only > a handful of people will ever end up using. >=20 > - Jimmy >=20 >=20 > On 9/20/06, Randy Julian <rkj...@in...> wrote: > > Hi, > > > > This works quite nicely! > > > > <?xml version=3D"1.0" encoding=3D"UTF-8"?> > > <xs:schema xmlns:xs=3D"http://www.w3.org/2001/XMLSchema" > > elementFormDefault=3D"qualified" = attributeFormDefault=3D"unqualified"> > > <xs:element name=3D"root"> > > <xs:complexType> > > <xs:sequence> > > <xs:element name=3D"MyList"> > > <xs:simpleType> > > <xs:list > > itemType=3D"xs:float"/> > > </xs:simpleType> > > </xs:element> > > </xs:sequence> > > </xs:complexType> > > </xs:element> > > </xs:schema> > > > > Validates: > > > > <?xml version=3D"1.0" encoding=3D"UTF-8"?> > > <root xmlns:xsi=3D"http://www.w3.org/2001/XMLSchema-instance" > > xsi:noNamespaceSchemaLocation=3D"list.xsd"> > > <MyList>1.1 1.2 1.3</MyList> > > </root> > > > > Any thoughts about the use of this in the schema? > > > > Randy > > > > -----Original Message----- > > From: Geer, Lewis (NIH/NLM/NCBI) [E]=20 > [mailto:le...@nc...] > > Sent: Wednesday, September 20, 2006 10:27 AM > > To: Randy Julian; psi...@li... > > Subject: RE: [Psidev-ms-dev] Why base64? > > > > Hi, > > > > XML-schema does allow space delimited lists: > > > > <xsd:simpleType name=3D"listOfMyIntType"> > > <xsd:list itemType=3D"integer"/> > > </xsd:simpleType> > > > > <listOfMyInt>20003 15037 95977 95945</listOfMyInt> > > > > Lewis > > > > > > > -----Original Message----- > > > From: Randy Julian [mailto:rkj...@in...] > > > Sent: Wednesday, September 20, 2006 10:12 AM > > > To: psi...@li... > > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > > > This is a very interesting question which has come up several > > > times before. > > > As we work to develop dataXML (mzData 2.0) we should take=20 > all of these > > > concerns into consideration. > > > > > > Originally, mzData had both a binary and regular XML notation > > > for both data > > > vectors. The XML-schema data types where tested by most of > > > the vendors who > > > did not see the file size compression benefits you mention > > > because they did > > > not feel they had the ability to round either of the vectors > > > in the way you > > > suggest. Since the use case: 'user opens mzData file with > > > notepad and see > > > peaks' was not viewed as a major request, the vendors > > > unanimously voted the > > > non-binary arrays out for size and performance reasons (see > > > the meeting > > > notes from the PSI meeting in Nice). > > > > > > The loss of readability may now have larger consequences than > > > we considered > > > back then. Steve Stein's comments are good ones. I we=20 > now have broad > > > enough adoption that we want to be able to open the file and > > > see the numbers > > > written out in XML, then we should reconsider the validity of > > > the use case. > > > To do this with mzData 1.05 you would have to use the > > > supplemental data > > > vector (the alternative Angel suggested). > > > > > > The supplemental data vectors hold any type of XSD data=20 > type including > > > normal XML. However in mzData 1.05, the binary vectors are > > > not optional, so > > > you have to populate them to comply with the spec - even if > > > you repeat the > > > information in the supplemental vector. > > > > > > The suggested 'white space separated list' is not a valid XML > > > data type, so > > > if we want to keep with the XSD standard for validation, the > > > peak lists have > > > to be in markup like: > > > > > > <peak> > > > <mz> > > > <float>0.1</float> > > > </mz> > > > <inten> > > > <float>100.1</float> > > > </inten> > > > </peak> > > > > > > or something similar. Other semantics could reduce the > > > verbosity, but the > > > basic idea is that we can only use valid XSD data types. > > > > > > As we move to dataXML, we will need to store other data > > > objects besides mass > > > spectra (MRM chromatograms for example), so we will have to > > > come up with a > > > more general data section regardless of the data types > > > allowed. During this > > > design phase we should decide what data types we want. > > > > > > As a historical note, the previous (current) LC-MS standard > > > format uses > > > netCDF as the data representation which is fully binary=20 > and utterly > > > unreadable in any respect without an API. Thus this > > > situation has existed > > > in mass spectrometry for quite some time. The readability of > > > these files > > > has never been viewed as a serious weakness, although the > > > 1.5-2x increase in > > > file size over the original vendor file was the source of constant > > > complaint. > > > > > > Just as a note for your comment #3, this is not so straight > > > forward. If the > > > instrument collects data using an Intel chip, floating-point > > > raw data will > > > most likely have a IEEE-754 representation. So any time you > > > have a number > > > in a file like 0.1, the internal representation was > > > originally different > > > (0.1 cannot be exactly represented in IEEE-754). When you > > > read from the file > > > into an IEEE standard format, it will not be 0.1 in any of > > > the math you do. > > > > > > Let the PSI-MS team know what requirements you would like to > > > see the HUPO > > > standards meet. If there is strong user support for missing > > > features, the > > > team will include them in the development roadmap. > > > > > > Let's keep the discussion of improvements going! > > > > > > Randy > > > > > > > > > -----Original Message----- > > > From: psi...@li... > > > [mailto:psi...@li...] On > > > Behalf Of Coleman, > > > Michael > > > Sent: Tuesday, September 19, 2006 4:39 PM > > > To: psi...@li... > > > Subject: [Psidev-ms-dev] Why base64? > > > > > > Hi, > > > > > > Does anyone know why base64 encoding is being used for peak mz and > > > intensity values in the mzData format? It appears to me that > > > there are > > > three significant disadvantages to doing so: > > > > > > 1. Loss of readability. One of the primary reasons to=20 > use XML in the > > > first place is that it is human-readable--one can in=20 > principle inspect > > > and understand its contents with any text editor. > > > Base64-encoding peak > > > data destroys this transparency. (It also makes it more=20 > difficult to > > > write scripts to process the data.) > > > > > > 2. Increased file size. At least for our spectra, it=20 > appears that a > > > compressed (gzip/etc) ms2 file is about 15% smaller than the > > > equivalent > > > mzData file with the single-precision (32-bit) encoding, and > > > 22% smaller > > > than the double-precision version. The *uncompressed* > > > single-precision > > > mzData file is about about 15% smaller than the=20 > uncompressed ms2 file; > > > the double-precision version is almost twice as large. =20 > (These figures > > > are for 'gzip' default compression.) > > > > > > (Currently our ms2 files have mz values rounded to one > > > decimal place and > > > intensity values with about 4-5 significant places.) > > > > > > 3. Potential loss of precision information. For example, with > > > single-precision encoding, a value originally given as > > > 12345.1 might be > > > encoded as 12345.0996. It's not easy to see from that > > > encoding that the > > > original value was given with one decimal place. =20 > Worse-still, if the > > > original value is significant to more than 7-or-so digits=20 > and it gets > > > 32-bit encoded, precision will be lost, probably in a way not > > > immediately apparent to the user. (32-bit encoding will=20 > probably be a > > > temptation, given the size of the 64-bit encoding.) > > > > > > Even if base64-encoding cannot be dropped at this point,=20 > it seems like > > > it would be useful to add a "no encode" option, which would > > > present peak > > > data as the obvious whitespace-separated list of numeric values. > > > > > > Am I missing something here? I could not find any=20 > discussion of this > > > issue on the list. > > > > > > --Mike > > > > > > > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > > > Stowers Institute for Biomedical Research > > > 1000 E. 50th St., Kansas City, MO 64110, USA > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > > > &CID=3DDEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > > > > -------------------------------------------------------------- > > > ----------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the > > > chance to share your > > > opinions on IT & business topics through brief surveys -- and > > > earn cash > > > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > > > &CID=3DDEVDEV > > > _______________________________________________ > > > Psidev-ms-dev mailing list > > > Psi...@li... > > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > > > >=20 > -------------------------------------------------------------- > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > > opinions on IT & business topics through brief surveys --=20 > and earn cash > >=20 > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > >=20 > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the=20 > chance to share your > opinions on IT & business topics through brief surveys -- and=20 > earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge > &CID=3DDEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 |
From: Angel P. <an...@ma...> - 2006-09-22 20:37:18
|
Jimmy Eng wrote: > I believe base64 encoding makes more sense for some large class of > applications that will hopefully be digesting these files but I'm sure > everyone can see the obvious benefits of plain text encoding of peak > lists. > > The question I have is regarding the representation of space delimited > lists as Lewis and Randy have drawn up. Does this address the needs > of Michael, Steve, and Akhilesh and others? Hopefully they'll all > chime in. My concern would be that having a horizontal, space > separate list of numbers, where m/z and intensity will possibly be > written in separate lists of floats and ints, doesn't really serve the > notion of readability. Lots of folks are used to looking at lists of > peaks as ordered in .mgf or .dta files and I'm not sure if a > horizontal list of numbers (especially if it's 2 lists, one for m/z > and one for intensity) gives you that same sense of readability. I > don't really see any regular use case scenarios where people would be > scrolling over to the 68th m/z in the list and then somehow counting > over to the location of the 68th intensity to get its value. > > So _if_ this really doesn't address the needs of the folks who have > concerns about the base64 encoding and would like like to see plain > text, speak up. The last thing the format needs is more complexity > in the form of another optional way of representing the data that only > a handful of people will ever end up using. > > - Jimmy > > All excellent points. Let me see if I can recap the set of arguments: 1) For high-throughput and computational task, base64 encoding is fast, robust and reasonable with respect to size 2) Text formats are not useful unless they are formatted in an easily digestible fashion 3) Point #2 often conflicts with point #1 4) Ambiguity in a format is universally seen as a "bad thing" The best suggestion I could think of would be to just go ahead and officially endorse our current standard operating procedures. By this I mean first and foremost, that the official format be restricted to binary encoded data arrays. This is the format officially supported by hardware and software vendors. Second, that we endorse one of the /de facto/ plain text formats (MS2 or MGF) as the best way to encode plain text data, *and *(this is the important bit) the official PSI API's provide export to the endorsed plain text format. Notice that I didn't say import, since this operation is a lossy one, as covered in other posts. Or if we do provide import routines, they come with the large caveat that the transformation may have been lossy. The problem I see with this is that the I do not know if MS2 or MGF handle data other than MS2 or from multiple analyzers and detectors. They also generally have a much more restricted set of annotations good idea? bad idea? Something to discuss in DC at least... -angel |
From: Randy J. <rkj...@in...> - 2006-09-23 18:08:25
|
My thought on this is that we should generalize the "data" section to be a 'class' with a 'type'. Kent will go into this in much more detail in DC, but the basic idea, as started in the teleconferences, is that the instrument could be described as a process (protocols) with inputs and outputs (parameterSets). An output should be defined as a class which could take on any number of 'types' which can include all of the XSD data types if we so choose. Generalization like this means that we can define a protocol which takes the base64 data as an input parameter and produces a lossy text representation as an output. The parameters of this protocol and its description would tell us how the conversion was done and what the remaining precision is (significant figures, etc.). In the version of dataXML we will be discussing in DC, the input parameterSet for the above protocol could be another dataXML document with the base64 encoded spectra inside, located at a specified URL and the protocol producing the "peaklist" could simply 'refer' to the source input rather than duplicating it. By this method, the use case where we want a human readable peaklist available in text format derived from an original 'raw' instrument acquisition file located in some repository somewhere can be achieved with the 'peaklist' document containing the least amount of redundant information possible. If we get this right, an XQuery compatible link to the 'original' data can be made allowing almost RDF-like transversal of documents across repositories. The cost of this flexibility is a more abstract schema (which we will review in DC), and much heavier reliance on the ontology. The result does not look like XML, but like RDF implemented in XML. This is all hard to digest without specific examples, so those coming to DC should be prepared to work through their favorite use case to make sure it's all working they way we want. If it is, the good news it that the language bindings and therefore the utilities will come very fast since the API's can be generated directly off the base UML (with the mandatory prestidigitation). Randy -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Angel Pizarro Sent: Friday, September 22, 2006 4:37 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] Why base64? Jimmy Eng wrote: > I believe base64 encoding makes more sense for some large class of > applications that will hopefully be digesting these files but I'm sure > everyone can see the obvious benefits of plain text encoding of peak > lists. > > The question I have is regarding the representation of space delimited > lists as Lewis and Randy have drawn up. Does this address the needs > of Michael, Steve, and Akhilesh and others? Hopefully they'll all > chime in. My concern would be that having a horizontal, space > separate list of numbers, where m/z and intensity will possibly be > written in separate lists of floats and ints, doesn't really serve the > notion of readability. Lots of folks are used to looking at lists of > peaks as ordered in .mgf or .dta files and I'm not sure if a > horizontal list of numbers (especially if it's 2 lists, one for m/z > and one for intensity) gives you that same sense of readability. I > don't really see any regular use case scenarios where people would be > scrolling over to the 68th m/z in the list and then somehow counting > over to the location of the 68th intensity to get its value. > > So _if_ this really doesn't address the needs of the folks who have > concerns about the base64 encoding and would like like to see plain > text, speak up. The last thing the format needs is more complexity > in the form of another optional way of representing the data that only > a handful of people will ever end up using. > > - Jimmy > > All excellent points. Let me see if I can recap the set of arguments: 1) For high-throughput and computational task, base64 encoding is fast, robust and reasonable with respect to size 2) Text formats are not useful unless they are formatted in an easily digestible fashion 3) Point #2 often conflicts with point #1 4) Ambiguity in a format is universally seen as a "bad thing" The best suggestion I could think of would be to just go ahead and officially endorse our current standard operating procedures. By this I mean first and foremost, that the official format be restricted to binary encoded data arrays. This is the format officially supported by hardware and software vendors. Second, that we endorse one of the /de facto/ plain text formats (MS2 or MGF) as the best way to encode plain text data, *and *(this is the important bit) the official PSI API's provide export to the endorsed plain text format. Notice that I didn't say import, since this operation is a lossy one, as covered in other posts. Or if we do provide import routines, they come with the large caveat that the transformation may have been lossy. The problem I see with this is that the I do not know if MS2 or MGF handle data other than MS2 or from multiple analyzers and detectors. They also generally have a much more restricted set of annotations good idea? bad idea? Something to discuss in DC at least... -angel ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Brian P. <bri...@in...> - 2006-09-20 17:14:17
|
Hello All, > This works quite nicely! <-snip-> > <MyList>1.1 1.2 1.3</MyList> Sure, but in practice it's not really all that readable: make that list some realistic length and you're going to need to snork it up into a table so that you can find the n'th item in the list to match it with the n'th item in some other list. At that point, you have once again passed the file through a software tool and may as well reap the benefits of base64 encoding. On the topic of software support tools, the TPP (and the IPP) furnish a fairly broad set of tools that read mzData, including the ability to dump it to ASCII. Excellent point by Randy about ASCII representations giving a false sense of computational precision. We can't ever forget that under the hood these boxen are base2. BTW if there really are integer data to be had, then mzData/mzXML ought to be able to hold those data as integer. In the converters I've worked with I don't recall seeing any such scan data, though. (AFAIK "ion counts" are really just inferred from a digitized analog sensor signal, there's not actually anything in there going "I see one ion, two ions, three ions..." - but I'm no MS hardware expert). Brian Pratt www.insilicos.com/IPP > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Randy Julian > Sent: Wednesday, September 20, 2006 8:23 AM > To: 'Geer, Lewis (NIH/NLM/NCBI) [E]'; > psi...@li... > Subject: Re: [Psidev-ms-dev] Why base64? > > Hi, > > This works quite nicely! > > <?xml version="1.0" encoding="UTF-8"?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > elementFormDefault="qualified" attributeFormDefault="unqualified"> > <xs:element name="root"> > <xs:complexType> > <xs:sequence> > <xs:element name="MyList"> > <xs:simpleType> > <xs:list > itemType="xs:float"/> > </xs:simpleType> > </xs:element> > </xs:sequence> > </xs:complexType> > </xs:element> > </xs:schema> > > Validates: > > <?xml version="1.0" encoding="UTF-8"?> > <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:noNamespaceSchemaLocation="list.xsd"> > <MyList>1.1 1.2 1.3</MyList> > </root> > > Any thoughts about the use of this in the schema? > > Randy > > -----Original Message----- > From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] > Sent: Wednesday, September 20, 2006 10:27 AM > To: Randy Julian; psi...@li... > Subject: RE: [Psidev-ms-dev] Why base64? > > Hi, > > XML-schema does allow space delimited lists: > > <xsd:simpleType name="listOfMyIntType"> > <xsd:list itemType="integer"/> > </xsd:simpleType> > > <listOfMyInt>20003 15037 95977 95945</listOfMyInt> > > Lewis > > > > -----Original Message----- > > From: Randy Julian [mailto:rkj...@in...] > > Sent: Wednesday, September 20, 2006 10:12 AM > > To: psi...@li... > > Subject: Re: [Psidev-ms-dev] Why base64? > > > > This is a very interesting question which has come up several > > times before. > > As we work to develop dataXML (mzData 2.0) we should take > all of these > > concerns into consideration. > > > > Originally, mzData had both a binary and regular XML notation > > for both data > > vectors. The XML-schema data types where tested by most of > > the vendors who > > did not see the file size compression benefits you mention > > because they did > > not feel they had the ability to round either of the vectors > > in the way you > > suggest. Since the use case: 'user opens mzData file with > > notepad and see > > peaks' was not viewed as a major request, the vendors > > unanimously voted the > > non-binary arrays out for size and performance reasons (see > > the meeting > > notes from the PSI meeting in Nice). > > > > The loss of readability may now have larger consequences than > > we considered > > back then. Steve Stein's comments are good ones. I we now > have broad > > enough adoption that we want to be able to open the file and > > see the numbers > > written out in XML, then we should reconsider the validity of > > the use case. > > To do this with mzData 1.05 you would have to use the > > supplemental data > > vector (the alternative Angel suggested). > > > > The supplemental data vectors hold any type of XSD data > type including > > normal XML. However in mzData 1.05, the binary vectors are > > not optional, so > > you have to populate them to comply with the spec - even if > > you repeat the > > information in the supplemental vector. > > > > The suggested 'white space separated list' is not a valid XML > > data type, so > > if we want to keep with the XSD standard for validation, the > > peak lists have > > to be in markup like: > > > > <peak> > > <mz> > > <float>0.1</float> > > </mz> > > <inten> > > <float>100.1</float> > > </inten> > > </peak> > > > > or something similar. Other semantics could reduce the > > verbosity, but the > > basic idea is that we can only use valid XSD data types. > > > > As we move to dataXML, we will need to store other data > > objects besides mass > > spectra (MRM chromatograms for example), so we will have to > > come up with a > > more general data section regardless of the data types > > allowed. During this > > design phase we should decide what data types we want. > > > > As a historical note, the previous (current) LC-MS standard > > format uses > > netCDF as the data representation which is fully binary and utterly > > unreadable in any respect without an API. Thus this > > situation has existed > > in mass spectrometry for quite some time. The readability of > > these files > > has never been viewed as a serious weakness, although the > > 1.5-2x increase in > > file size over the original vendor file was the source of constant > > complaint. > > > > Just as a note for your comment #3, this is not so straight > > forward. If the > > instrument collects data using an Intel chip, floating-point > > raw data will > > most likely have a IEEE-754 representation. So any time you > > have a number > > in a file like 0.1, the internal representation was > > originally different > > (0.1 cannot be exactly represented in IEEE-754). When you > > read from the file > > into an IEEE standard format, it will not be 0.1 in any of > > the math you do. > > > > Let the PSI-MS team know what requirements you would like to > > see the HUPO > > standards meet. If there is strong user support for missing > > features, the > > team will include them in the development roadmap. > > > > Let's keep the discussion of improvements going! > > > > Randy > > > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On > > Behalf Of Coleman, > > Michael > > Sent: Tuesday, September 19, 2006 4:39 PM > > To: psi...@li... > > Subject: [Psidev-ms-dev] Why base64? > > > > Hi, > > > > Does anyone know why base64 encoding is being used for peak mz and > > intensity values in the mzData format? It appears to me that > > there are > > three significant disadvantages to doing so: > > > > 1. Loss of readability. One of the primary reasons to use > XML in the > > first place is that it is human-readable--one can in > principle inspect > > and understand its contents with any text editor. > > Base64-encoding peak > > data destroys this transparency. (It also makes it more > difficult to > > write scripts to process the data.) > > > > 2. Increased file size. At least for our spectra, it > appears that a > > compressed (gzip/etc) ms2 file is about 15% smaller than the > > equivalent > > mzData file with the single-precision (32-bit) encoding, and > > 22% smaller > > than the double-precision version. The *uncompressed* > > single-precision > > mzData file is about about 15% smaller than the > uncompressed ms2 file; > > the double-precision version is almost twice as large. > (These figures > > are for 'gzip' default compression.) > > > > (Currently our ms2 files have mz values rounded to one > > decimal place and > > intensity values with about 4-5 significant places.) > > > > 3. Potential loss of precision information. For example, with > > single-precision encoding, a value originally given as > > 12345.1 might be > > encoded as 12345.0996. It's not easy to see from that > > encoding that the > > original value was given with one decimal place. > Worse-still, if the > > original value is significant to more than 7-or-so digits > and it gets > > 32-bit encoded, precision will be lost, probably in a way not > > immediately apparent to the user. (32-bit encoding will > probably be a > > temptation, given the size of the 64-bit encoding.) > > > > Even if base64-encoding cannot be dropped at this point, it > seems like > > it would be useful to add a "no encode" option, which would > > present peak > > data as the obvious whitespace-separated list of numeric values. > > > > Am I missing something here? I could not find any > discussion of this > > issue on the list. > > > > --Mike > > > > > > Mike Coleman, Scientific Programmer, +1 816 926 4419 > > Stowers Institute for Biomedical Research > > 1000 E. 50th St., Kansas City, MO 64110, USA > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -------------------------------------------------------------- > > ----------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the > > chance to share your > > opinions on IT & business topics through brief surveys -- and > > earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge > > &CID=DEVDEV > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the > chance to share your > opinions on IT & business topics through brief surveys -- and > earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |