Re: [Psidev-ms-dev] Why base64?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

This works quite nicely!

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified">
	<xs:element name="root">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="MyList">
					<xs:simpleType>
						<xs:list
itemType="xs:float"/>
					</xs:simpleType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>

Validates:

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="list.xsd">
	<MyList>1.1 1.2 1.3</MyList>
</root>

Any thoughts about the use of this in the schema?

Randy

-----Original Message-----
From: Geer, Lewis (NIH/NLM/NCBI) [E] [mailto:le...@nc...] 
Sent: Wednesday, September 20, 2006 10:27 AM
To: Randy Julian; psi...@li...
Subject: RE: [Psidev-ms-dev] Why base64?

Hi,

XML-schema does allow space delimited lists:

<xsd:simpleType name="listOfMyIntType">
  <xsd:list itemType="integer"/>
</xsd:simpleType>

<listOfMyInt>20003 15037 95977 95945</listOfMyInt>

Lewis

> -----Original Message-----
> From: Randy Julian [mailto:rkj...@in...] 
> Sent: Wednesday, September 20, 2006 10:12 AM
> To: psi...@li...
> Subject: Re: [Psidev-ms-dev] Why base64?
> 
> This is a very interesting question which has come up several 
> times before.
> As we work to develop dataXML (mzData 2.0) we should take all of these
> concerns into consideration.
> 
> Originally, mzData had both a binary and regular XML notation 
> for both data
> vectors.  The XML-schema data types where tested by most of 
> the vendors who
> did not see the file size compression benefits you mention 
> because they did
> not feel they had the ability to round either of the vectors 
> in the way you
> suggest.  Since the use case: 'user opens mzData file with 
> notepad and see
> peaks' was not viewed as a major request, the vendors 
> unanimously voted the
> non-binary arrays out for size and performance reasons (see 
> the meeting
> notes from the PSI meeting in Nice).
> 
> The loss of readability may now have larger consequences than 
> we considered
> back then.  Steve Stein's comments are good ones.  I we now have broad
> enough adoption that we want to be able to open the file and 
> see the numbers
> written out in XML, then we should reconsider the validity of 
> the use case.
> To do this with mzData 1.05 you would have to use the 
> supplemental data
> vector (the alternative Angel suggested).
> 
> The supplemental data vectors hold any type of XSD data type including
> normal XML.  However in mzData 1.05, the binary vectors are 
> not optional, so
> you have to populate them to comply with the spec - even if 
> you repeat the
> information in the supplemental vector.
> 
> The suggested 'white space separated list' is not a valid XML 
> data type, so
> if we want to keep with the XSD standard for validation, the 
> peak lists have
> to be in markup like:
> 
> <peak>
>   <mz>
>     <float>0.1</float>
>   </mz>
>   <inten>
>      <float>100.1</float>
>   </inten>
> </peak>
> 
> or something similar.  Other semantics could reduce the 
> verbosity, but the
> basic idea is that we can only use valid XSD data types.
> 
> As we move to dataXML, we will need to store other data 
> objects besides mass
> spectra (MRM chromatograms for example), so we will have to 
> come up with a
> more general data section regardless of the data types 
> allowed.  During this
> design phase we should decide what data types we want.
> 
> As a historical note, the previous (current) LC-MS standard 
> format uses
> netCDF as the data representation which is fully binary and utterly
> unreadable in any respect without an API.  Thus this 
> situation has existed
> in mass spectrometry for quite some time.  The readability of 
> these files
> has never been viewed as a serious weakness, although the 
> 1.5-2x increase in
> file size over the original vendor file was the source of constant
> complaint.
> 
> Just as a note for your comment #3, this is not so straight 
> forward.  If the
> instrument collects data using an Intel chip, floating-point 
> raw data will
> most likely have a IEEE-754 representation.  So any time you 
> have a number
> in a file like 0.1, the internal representation was 
> originally different
> (0.1 cannot be exactly represented in IEEE-754). When you 
> read from the file
> into an IEEE standard format, it will not be 0.1 in any of 
> the math you do.
> 
> Let the PSI-MS team know what requirements you would like to 
> see the HUPO
> standards meet.  If there is strong user support for missing 
> features, the
> team will include them in the development roadmap.
> 
> Let's keep the discussion of improvements going!
> 
> Randy
> 
> 
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On 
> Behalf Of Coleman,
> Michael
> Sent: Tuesday, September 19, 2006 4:39 PM
> To: psi...@li...
> Subject: [Psidev-ms-dev] Why base64?
> 
> Hi,
> 
> Does anyone know why base64 encoding is being used for peak mz and
> intensity values in the mzData format?  It appears to me that 
> there are
> three significant disadvantages to doing so:
> 
> 1.  Loss of readability.  One of the primary reasons to use XML in the
> first place is that it is human-readable--one can in principle inspect
> and understand its contents with any text editor.  
> Base64-encoding peak
> data destroys this transparency.  (It also makes it more difficult to
> write scripts to process the data.)
> 
> 2.  Increased file size.  At least for our spectra, it appears that a
> compressed (gzip/etc) ms2 file is about 15% smaller than the 
> equivalent
> mzData file with the single-precision (32-bit) encoding, and 
> 22% smaller
> than the double-precision version.  The *uncompressed* 
> single-precision
> mzData file is about about 15% smaller than the uncompressed ms2 file;
> the double-precision version is almost twice as large.  (These figures
> are for 'gzip' default compression.)
> 
> (Currently our ms2 files have mz values rounded to one 
> decimal place and
> intensity values with about 4-5 significant places.)
> 
> 3.  Potential loss of precision information.  For example, with
> single-precision encoding, a value originally given as 
> 12345.1 might be
> encoded as 12345.0996.  It's not easy to see from that 
> encoding that the
> original value was given with one decimal place.  Worse-still, if the
> original value is significant to more than 7-or-so digits and it gets
> 32-bit encoded, precision will be lost, probably in a way not
> immediately apparent to the user.  (32-bit encoding will probably be a
> temptation, given the size of the 64-bit encoding.)
> 
> Even if base64-encoding cannot be dropped at this point, it seems like
> it would be useful to add a "no encode" option, which would 
> present peak
> data as the obvious whitespace-separated list of numeric values.
> 
> Am I missing something here?  I could not find any discussion of this
> issue on the list.  
> 
> --Mike
> 
> 
> Mike Coleman, Scientific Programmer, +1 816 926 4419
> Stowers Institute for Biomedical Research
> 1000 E. 50th St., Kansas City, MO  64110,  USA
> 
> --------------------------------------------------------------
> -----------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the 
> chance to share your
> opinions on IT & business topics through brief surveys -- and 
> earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge
> &CID=DEVDEV
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> 
> 
> --------------------------------------------------------------
> -----------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the 
> chance to share your
> opinions on IT & business topics through brief surveys -- and 
> earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge
> &CID=DEVDEV
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
>