Re: [Psidev-ms-dev] Separate binary file for very large data sets?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 4/20/07, Coleman, Michael <MK...@st...> wrote:
>
>  Angel, thanks for the clarification.
>

no prob

I agree that there'd generally be no reason not to zlib-compress--the base6=
4
> string is already quite opaque.  The only factors I'd see would be (a)
> occasionally zlib will make strings larger (though not by too much), and
>

this does happen occasionally for spectra without any "real" data in it, bu=
t
as you say, this does not increase much and this is an edge case anyway. By
way of (b) I sincerely doubt that any instrument vendor will use dataxml as
their native format, so there will always be some sort of native-format to
dataxml conversion process, which usually lives on a "smart" data analysis
machine ;)

-angel

(b) whether it'd be a burden for dumb instruments to have to know zlib.
>
> Mike
>
>
>  -----Original Message-----
> *From:* Brian Pratt [mailto:bri...@in...]
> *Sent:* Friday, April 20, 2007 3:04 PM
> *To:* 'Angel Pizarro'; Coleman, Michael
> *Cc:* psi...@li...; 'Andreas R=F6mpp'
> *Subject:* RE: [Psidev-ms-dev] Separate binary file for very large data
> sets?
>
> Angel is correct - I did mean compression (base64 encoding actually bloat=
s
> the data a bit).
>
> I believe the proposed next version of mzData adopts Patrick Pedrioli's
> mzXML 3.0 technique of optionally first compressing (with zlib) the binar=
y
> data before encoding it (with base64).    I'd argue for making it mandito=
ry,
> since it's not like there's any loss of human-readability, and the files =
do
> shrink admirably.
>
> - Brian
>
>  ------------------------------
> *From:* psi...@li... [mailto:
> psi...@li...] *On Behalf Of *Angel Pizarro
> *Sent:* Friday, April 20, 2007 12:43 PM
> *To:* Coleman, Michael
> *Cc:* psi...@li...; Brian Pratt; Andreas R=F6mpp
> *Subject:* Re: [Psidev-ms-dev] Separate binary file for very large data
> sets?
>
> Brian is refering to the way mzXML 3.0 (and the new dataXML format) zlib
> compress the "base64-encoded" spectra. While not an official part of the
> mzData 1.05 schema, I believe there are several groups that use this
> technique in-house for storage of mzData files.
>
> -angel
>
>
> On 4/20/07, Coleman, Michael <MK...@st...> wrote:
> >
> > If by "compressed" you mean "base64-encoded", I think it's important to
> > use the latter term, to avoid giving the wrong impression.  As far as I
> > know, compression is not a feature--nor a goal--of mzData.
> >
> > For what it's worth, I encountered my first mzData file in a work
> > situation this week.  It's 2.7 times as large as the corresponding ms2
> > file.
> >
> > Mike
> >
> >
> >
> > > -----Original Message-----
> > > From: psi...@li...
> > > [mailto:psi...@li... ] On
> > > Behalf Of Brian Pratt
> > > Sent: Friday, April 20, 2007 2:02 PM
> > > To: 'Andreas R=F6mpp'; psi...@li...
> > > Subject: Re: [Psidev-ms-dev] Separate binary file for very
> > > large data sets?
> > >
> > >
> > > I wonder if it wouldn't make as much sense to treat the
> > > mzData file as the
> > > "binary file" and come up with a sort of summary schema of
> > > your own that
> > > could point into the mzData file.  You'd get maximum reuse of
> > > community
> > > source code that way.
> > >
> > > But first, I'd say try it with straight-up mzData with
> > > compressed peak lists
> > > and see if you really need to go to the bother of a separate
> > > file.  I'm
> > > guessing you'll be pleasantly surprised.  Plus, I really,
> > > really dislike the
> > > use of interdependent files - one or the other is forever
> > > getting out of
> > > synch, lost, renamed, etc.
> > >
> > > Hope this helps,
> > >
> > > Brian Pratt
> > > www.insilicos.com
> > >
> > > -----Original Message-----
> > > From: psi...@li...
> > > [mailto:psi...@li... ] On
> > > Behalf Of Andreas
> > > R=F6mpp
> > > Sent: Friday, April 20, 2007 8:45 AM
> > > To: psi...@li...;
> > > And...@an...
> > > Subject: [Psidev-ms-dev] Separate binary file for very large
> > > data sets?
> > >
> > > Hello everybody,
> > >
> > > We develop software for imaging mass spectrometry in the
> > > framework of a
> > > project funded by the European Union. We intend to use
> > > dataXML as a standard
> > > format to exchange data between the different partner labs
> > > and also (as far
> > > as possible) as the internal data format for a joint
> > > processing software
> > > suite. However, we run into the problem of very large data
> > > sets which can
> > > easily exceed 1GB (e.g. 256
> > > *256 pixels with one high resolution mass spectrum each). Therefore w=
e
> >
> > > thought about storing the spectrum data
> > > ('MassToChargeRatioArray' and '
> > > 'IntensityArray') in a separate binary file. This would make
> > > data handling
> > > much faster and easier ( e.g. when parsing the XML file). So instead
> > of
> > > writing the binary data in the XML file we plan to include a link to =
a
> > > separate file (file location, start and end position of
> > > spectrum in binary
> > > file).
> > > This problem is somewhat similar to the already discussed
> > > issue of an index
> > > file.
> > > Would it be possible to include such an option (external
> > > binary file) into
> > > the dataXML standard?
> > >
> > > Best regards,
> > > Andreas
> > >
> > > --
> > > --------------------------------------------------------------
> > > --------------
> > > -------------
> > > Dr. Andreas Roempp
> > > Institute of Inorganic and Analytical Chemistry
> > > - Analytical Chemistry -
> > > Justus Liebig University Giessen
> > > Schubertstrasse 60, Build. 16
> > > D-35392 Giessen
> > > Germany
> > >
> > > phone:  +49-641-99 34802
> > > fax:    +49-641-99 34809
> > > email: And...@an...
> > > Internet: http://www.uni-giessen.de/analytik/
> > >
> > >
> > >
> > >
> > > --------------------------------------------------------------
> > > -----------
> > > This SF.net email is sponsored by DB2 Express Download DB2
> > > Express C - the
> > > FREE version of DB2 express and take control of your XML. No
> > > limits. Just
> > > data. Click to get it now.
> > > http://sourceforge.net/powerbar/db2/
> > > _______________________________________________
> > > Psidev-ms-dev mailing list
> > > Psi...@li...
> > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> > >
> > >
> > > --------------------------------------------------------------
> > > -----------
> > > This SF.net email is sponsored by DB2 Express
> > > Download DB2 Express C - the FREE version of DB2 express and take
> > > control of your XML. No limits. Just data. Click to get it now.
> > > http://sourceforge.net/powerbar/db2/
> > > _______________________________________________
> > > Psidev-ms-dev mailing list
> > > Psi...@li...
> > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> > >
> >
> >
> > -----------------------------------------------------------------------=
--
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2 express and take
> > control of your XML. No limits. Just data. Click to get it now.
> > http://sourceforge.net/powerbar/db2/
> > _______________________________________________
> > Psidev-ms-dev mailing list
> > Psi...@li...
> > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> >
>
>
>
> --
> Angel Pizarro
> Director, Bioinformatics Facility
> Institute for Translational Medicine and Therapeutics
> University of Pennsylvania
> 806 BRB II/III
> 421 Curie Blvd.
> Philadelphia, PA 19104-6160
>
> P: 215-573-3736
> F: 215-573-9004
>
>

--=20
Angel Pizarro
Director, Bioinformatics Facility
Institute for Translational Medicine and Therapeutics
University of Pennsylvania
806 BRB II/III
421 Curie Blvd.
Philadelphia, PA 19104-6160

P: 215-573-3736
F: 215-573-9004