From: Eric D. <ede...@sy...> - 2013-10-18 22:40:46
|
Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at the usual time. We will have a discussion about the ongoing numpress work to dramatically improve the compression options for mzML. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # |
From: Eric D. <ede...@sy...> - 2013-10-18 22:41:38
|
Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at the usual time. We will have a discussion about the ongoing numpress work to dramatically improve the compression options for mzML. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # |
From: Eric D. <Eri...@sy...> - 2013-10-22 03:05:06
|
Hi everyone, just a reminder that we will have a PSI MS/PI WG teleconference to discuss mzML data compression and related topics in 12 hr. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # *From:* Eric Deutsch [mailto:ede...@sy...] *Sent:* Friday, October 18, 2013 3:39 PM *To:* Mass spectrometry standard development; psi...@li... *Cc:* Eric Deutsch *Subject:* PSI MS call Tuesday: compression Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at the usual time. We will have a discussion about the ongoing numpress work to dramatically improve the compression options for mzML. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # |
From: Mathias W. <wa...@in...> - 2013-10-22 15:35:48
|
Hi Eric, we, the openMS guys (foremost those working with swath files) are eager to make use of such a neat compression. To assist in making the right calls when plugging in the lib(s) to our mzML infrastructure in OpenMS I was hoping to get some insights during the call. Does one of these doc's you were talking about contain where numpress hooks into the mzML schema? What modes are envisaged, at which point the cv terms are supposed to come into play ans so on. If so, can you send me the respective files? No OpenMS developer is actively working on mzML right now on a regular basis. So integration time into OpenMS will take some time. It would be great to have a head start, before you get this finalized. Cheers, Viele Grüße, -- Mathias Walzer University of Tuebingen Wilhelm Schickard Institute for Computer Science Division for Applied Bioinformatics Room C313, Sand 14, D-72076 Tuebingen phone: +49 (0)7071-29 70437 and +49 (0)7071-29 87649 fax: +49 (0)7071-29 5152 Diese Nachricht ist 100% biologisch abbaubar! ----- Original Message ----- From: "Eric Deutsch" <Eri...@sy...> To: "Mass spectrometry standard development" <psi...@li...>, psi...@li... Cc: "Eric Deutsch" <ede...@sy...> Sent: Tuesday, 22 October, 2013 5:04:59 AM Subject: Re: [Psidev-ms-dev] PSI MS call Tuesday: compression Hi everyone, just a reminder that we will have a PSI MS/PI WG teleconference to discuss mzML data compression and related topics in 12 hr. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # From: Eric Deutsch [mailto: ede...@sy... ] Sent: Friday, October 18, 2013 3:39 PM To: Mass spectrometry standard development; psi...@li... Cc: Eric Deutsch Subject: PSI MS call Tuesday: compression Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at the usual time. We will have a discussion about the ongoing numpress work to dramatically improve the compression options for mzML. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Eric D. <Eri...@sy...> - 2013-10-23 06:47:43
|
Hi Mathias, I do not have this information at hand. We certainly would need to come up with this soon. The code and what documentation there is so far is at: https://github.com/fickludd/ms-numpress Maybe Johan has some additional documentation? Thanks, Eric > -----Original Message----- > From: Mathias Walzer [mailto:wa...@in...] > Sent: Tuesday, October 22, 2013 8:35 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] PSI MS call Tuesday: compression > > Hi Eric, > > we, the openMS guys (foremost those working with swath files) are eager > to make use of such a neat compression. To assist in making the right > calls when plugging in the lib(s) to our mzML infrastructure in OpenMS > I was hoping to get some insights during the call. Does one of these > doc's you were talking about contain where numpress hooks into the mzML > schema? What modes are envisaged, at which point the cv terms are > supposed to come into play ans so on. > If so, can you send me the respective files? > No OpenMS developer is actively working on mzML right now on a regular > basis. So integration time into OpenMS will take some time. It would be > great to have a head start, before you get this finalized. > > Cheers, > Viele Grüße, > -- > Mathias Walzer > > University of Tuebingen > Wilhelm Schickard Institute for Computer Science > Division for Applied Bioinformatics > Room C313, Sand 14, D-72076 Tuebingen > > phone: +49 (0)7071-29 70437 and +49 (0)7071-29 87649 > fax: +49 (0)7071-29 5152 > > Diese Nachricht ist 100% biologisch abbaubar! > > > > > > ----- Original Message ----- > From: "Eric Deutsch" <Eri...@sy...> > To: "Mass spectrometry standard development" <psidev-ms- > de...@li...>, psi...@li... > Cc: "Eric Deutsch" <ede...@sy...> > Sent: Tuesday, 22 October, 2013 5:04:59 AM > Subject: Re: [Psidev-ms-dev] PSI MS call Tuesday: compression > > > > > > Hi everyone, just a reminder that we will have a PSI MS/PI WG > teleconference to discuss mzML data compression and related topics in > 12 hr. > > > > 08:00 San Francisco > > 11:00 New York > > 16:00 London > > 17:00 Geneva > > > > + Germany: 08001012079 > > + Switzerland: 0800000860 > > + UK: 08081095644 > > + USA: 1-866-832-8490 > > > > Generic international: +44 2083222500 (UK number) > > > > access code: 297427 # > > > > > > > > > > > From: Eric Deutsch [mailto: ede...@sy... ] > Sent: Friday, October 18, 2013 3:39 PM > To: Mass spectrometry standard development; psidev-pi- > de...@li... > Cc: Eric Deutsch > Subject: PSI MS call Tuesday: compression > > > > Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at > the usual time. > > > > We will have a discussion about the ongoing numpress work to > dramatically improve the compression options for mzML. > > 08:00 San Francisco > > 11:00 New York > > 16:00 London > > 17:00 Geneva > > + Germany: 08001012079 > > + Switzerland: 0800000860 > > + UK: 08081095644 > > + USA: 1-866-832-8490 > > Generic international: +44 2083222500 (UK number) > > access code: 297427 # > > > > > ----------------------------------------------------------------------- > ------- > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the > most from > the latest Intel processors and coprocessors. See abstracts and > register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.cl > ktrk > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ----------------------------------------------------------------------- > ------- > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the > most from > the latest Intel processors and coprocessors. See abstracts and > register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.cl > ktrk > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Eric D. <ede...@sy...> - 2013-10-22 19:16:22
|
PSI MS/PI telecall on numpress mzML compression Present: Juan Antonio, Andy, Eric, Brian, Fredrik, Johan, Mathias, - Reviewed the top summary document o is missing mz5 zlib o also show an mzMLdelta o numAllzlib .gz - Brian added a bit of code to ProteoWizard to limit a potential too high loss. If the actual observed loss exceeds a settable threshold, the spectrum is written uncompressed. - mz5 files > ~2 GB cannot be read by Pwiz due to a bug. Johann will report - It does seem from the readspeed figure, mzML with numpress still doesn’t compete so well with mz5 - Andy wonders if there are optimizations in the XML parsing. Are many objects being created and discarded during reading? - Maximum observed delta error across all test files was 0.002 ppm - Johann has imzML code for ProteoWizard. If he can develop a good patch for the very latest trunk of proteowizard, Brian could probably apply it quickly. - Andy broaches the concern that if we are going to open the door to lossy compression, maybe we can do a whole lot better. Before the PSI approves any specific approach, we should do some more homework to make sure that anyone currently working on this problem is heard and can contribute and we officially approve the best approach. We don’t want to be in a possible where we release one version and then a year from now have another much better compression version. We will follow up my email and schedule another call. Thanks, Eric *From:* Eric Deutsch [mailto:Eri...@sy...] *Sent:* Monday, October 21, 2013 8:05 PM *To:* Mass spectrometry standard development; psi...@li... *Cc:* Eric Deutsch *Subject:* RE: PSI MS call Tuesday: compression Hi everyone, just a reminder that we will have a PSI MS/PI WG teleconference to discuss mzML data compression and related topics in 12 hr. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # *From:* Eric Deutsch [mailto:ede...@sy...] *Sent:* Friday, October 18, 2013 3:39 PM *To:* Mass spectrometry standard development; psi...@li... *Cc:* Eric Deutsch *Subject:* PSI MS call Tuesday: compression Hi everyone, let’s have a PSI MS/PI WG teleconference call Tuesday at the usual time. We will have a discussion about the ongoing numpress work to dramatically improve the compression options for mzML. 08:00 San Francisco 11:00 New York 16:00 London 17:00 Geneva + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-832-8490 Generic international: +44 2083222500 (UK number) access code: 297427 # |
From: Fredrik L. <Fre...@im...> - 2013-10-23 14:01:26
|
Hi All, As much of the discussion has been going on off list I thought that I could give some background, and also a proposal: We have developed lightweight numerical compression schemes for binary mass spectrometry data intended for compressing the binary arrays in mzML. They reduce the file sizes quite dramatically, while still maintaining the mzML file in text format. There is already support for these compression methods in several tools, and more will follow soon. We intend to submit a manuscript describing the compression and implementations shortly. Java and C++ libraries can be found here: https://github.com/fickludd/ms-numpress/. If anyone is interested in implementing support for the compression in your tool, please contact me and we can provide some input. There are also some implementations which haven't made it to the public versions yet. The files produced are schematically and semantically valid mzML 1.1.0, since the compression methods are defined for the binary arrays using ontology terms that are in the current PSI ontology. However, readers that haven't implemented support for this compression method will break. So there is need to consider this within the mzML standard to prevent a mess. The files which use new compression methods should be flagged in some way more than the ontology version. A feasible solution is that files that use any new compression scheme are marked as mzML 1.2. An addendum to the mzML documentation should describe this new feature. The schema and mapping file wouldn't need any updates, except for the version number. However, it would be possible to allow for double compression, i.e. the use of both numpress and zlib on a binary array, and that would require a change in the mapping file. While we wouldn't use double compression in our lab (mzML.gz with numpressed binaries is still more efficient when it comes to size), it could be good to leave the door open for that as others may want to. So we propose an increased mzML version and possibly an updated mapping file. mzML files that are not compressed should still be indicated as 1.1.0. There may well be other binary compression methods that are efficient and easy to implement, and if a new mzML version is released, it should include support for these. We do want stability when it comes to standard formats and no short term changes. It would therefore be good to have a public document process with a fixed deadline, sometime next year, and anyone can propose compression methods to be included in this new mzML version. For evaluating compression methods we could supply test files from different vendors and instrument types used in our manuscript, which can be used for benchmarking. If too many different compression methods are proposed, the PSI may choose the subset of methods that best facilitates different use-cases, while striving to minimize the complexity of the standard. There are also possibilities to represent mzML and other PSI formats in binary format, as is done efficiently in mz5, or in more brief text formats amenable to compression. However, I would propose to leave that to a separate process. Thanks, Fredrik On 2013-10-22 21:16, Eric Deutsch wrote: > > PSI MS/PI telecall on numpress mzML compression > > Present: Juan Antonio, Andy, Eric, Brian, Fredrik, Johan, Mathias, > > - Reviewed the top summary document > > o is missing mz5 zlib > > o also show an mzMLdelta > > o numAllzlib .gz > > - Brian added a bit of code to ProteoWizard to limit a potential too > high loss. If the actual observed loss exceeds a settable threshold, > the spectrum is written uncompressed. > > - mz5 files > ~2 GB cannot be read by Pwiz due to a bug. Johann will > report > > - It does seem from the readspeed figure, mzML with numpress still > doesn't compete so well with mz5 > > - Andy wonders if there are optimizations in the XML parsing. Are many > objects being created and discarded during reading? > > - Maximum observed delta error across all test files was 0.002 ppm > > - Johann has imzML code for ProteoWizard. If he can develop a good > patch for the very latest trunk of proteowizard, Brian could probably > apply it quickly. > > - Andy broaches the concern that if we are going to open the door to > lossy compression, maybe we can do a whole lot better. Before the PSI > approves any specific approach, we should do some more homework to > make sure that anyone currently working on this problem is heard and > can contribute and we officially approve the best approach. We don't > want to be in a possible where we release one version and then a year > from now have another much better compression version. > > We will follow up my email and schedule another call. > > Thanks, > > Eric > > *From:*Eric Deutsch [mailto:Eri...@sy... > <mailto:Eri...@sy...>] > *Sent:* Monday, October 21, 2013 8:05 PM > *To:* Mass spectrometry standard development; > psi...@li... > <mailto:psi...@li...> > *Cc:* Eric Deutsch > *Subject:* RE: PSI MS call Tuesday: compression > > Hi everyone, just a reminder that we will have a PSI MS/PI WG > teleconference to discuss mzML data compression and related topics in > 12 hr. > > 08:00 San Francisco > > 11:00 New York > > 16:00 London > > 17:00 Geneva > > + Germany: 08001012079 > > + Switzerland: 0800000860 > > + UK: 08081095644 > > + USA: 1-866-832-8490 > > Generic international: +44 2083222500 (UK number) > > access code: 297427 # > > *From:*Eric Deutsch [mailto:ede...@sy... > <mailto:ede...@sy...>] > *Sent:* Friday, October 18, 2013 3:39 PM > *To:* Mass spectrometry standard development; > psi...@li... > <mailto:psi...@li...> > *Cc:* Eric Deutsch > *Subject:* PSI MS call Tuesday: compression > > Hi everyone, let's have a PSI MS/PI WG teleconference call Tuesday at > the usual time. > > We will have a discussion about the ongoing numpress work to > dramatically improve the compression options for mzML. > > 08:00 San Francisco > > 11:00 New York > > 16:00 London > > 17:00 Geneva > > + Germany: 08001012079 > > + Switzerland: 0800000860 > > + UK: 08081095644 > > + USA: 1-866-832-8490 > > Generic international: +44 2083222500 (UK number) > > access code: 297427 # > |
From: Steffen N. <sne...@ip...> - 2013-10-23 19:21:13
|
Hi, > A feasible solution is that files that use any new compression scheme > are marked as mzML 1.2. > I am a bit hesitant to support bumping the schema version just for a new encoding of the base64 data: If you have some raw -> mzML converter that had a command line switch whether to num(com)press or not, would that writer need to support both XSD versions and decide between uncompressed mzML-1.1 and numpressed mzML-1.2 based on that command line switch ? That doesn't make sense, and writer software would likely write mzML-1.2 compressed or not. Even if the XSD is identical, old 1.1 software that checks the mzML version will fail, even if the mzML-1.2 files contain uncompressed data. All sensible software should look at the cvParams describing the encoding in the <binary>. So please why is it insufficient to just use the already existing MS-numpress terms in the CV ? http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MS&termId=MS:1000572&termName=binary%20data%20compression%20type <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" /> <cvParam cvRef="MS" accession="MS:1002312" name="MS-Numpress linear prediction compression" /> <binary>...</binary> > However, readers that haven't implemented support > for this compression method will break. Yes. and Readers that have not implemented mzML-1.2 will do as well. A reader *must* look for compression cvParams before decoding, and if they encounter something unexpected (i.e. not "no compression" nor "zlib" they should throw some "unknown compression scheme" exception). Or did I miss something ? > However, it would be possible to allow for double compression, i.e. > the use of both numpress and zlib on a binary array, and that would > require a change in the mapping file. Why does that require a change in the mapping ? Even if we'd allow two compression schemes, just having two cvParams won't be specific enough: <cvParam cvRef="MS" accession="MS:..." name="zlib" /> <cvParam cvRef="MS" accession="MS:..." name="MS-numpress" /> Which order do you use ? First numpress, then zlib or the other way around ? (ok, that's obvious here...) but otherwise you would not know the order. Because we (hope we'd) only have few "sanctioned" compression schemes and combinations, I'd suggest to encode that in the cvParam, e.g. either "MS-numpress-zlib" or "zlib-MS-numpress" just as an example. > Yours, Steffen -- IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 |
From: Matt C. <mat...@gm...> - 2013-10-24 14:38:25
|
Sorry, I haven't been checking this list lately... I totally agree with Steffen (except MSNumPress is a compression of the binary data which then must be encoded in base64, which always bloats the data by ~33% no matter what compression you've done). But there are still readers that don't handle zlib compressed mzML or mzXML, and the only way those readers should be properly dealing with that is through the existing CV mechanism. A bigger concern (because I think most readers fall into this category) is for readers that have an offline and/or hard-coded version of the CV (like pwiz) that has to be kept in sync. Old versions of pwiz won't be able to recognize the new CV terms. Looking up new unknown terms online is a long-standing missing feature in pwiz, but unfortunately it's still not likely to get added soon. I'm not sure how many of the other offline-CV readers have an online component yet but I suspect not many. -Matt On 10/23/2013 2:21 PM, Steffen Neumann wrote: > Hi, > >> A feasible solution is that files that use any new compression scheme >> are marked as mzML 1.2. >> > I am a bit hesitant to support bumping the schema version > just for a new encoding of the base64 data: > > If you have some raw -> mzML converter that had a command line switch > whether to num(com)press or not, would that writer need to support > both XSD versions and decide between uncompressed mzML-1.1 > and numpressed mzML-1.2 based on that command line switch ? > That doesn't make sense, and writer software would likely > write mzML-1.2 compressed or not. > > Even if the XSD is identical, old 1.1 software that checks > the mzML version will fail, even if the mzML-1.2 files > contain uncompressed data. > > All sensible software should look at the cvParams describing > the encoding in the <binary>. So please why is it insufficient to > just use the already existing MS-numpress terms in the CV ? > > http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MS&termId=MS:1000572&termName=binary%20data%20compression%20type > > <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" /> > <cvParam cvRef="MS" accession="MS:1002312" name="MS-Numpress linear prediction compression" /> > <binary>...</binary> > >> However, readers that haven't implemented support >> for this compression method will break. > Yes. and Readers that have not implemented mzML-1.2 will do as well. > A reader *must* look for compression cvParams before decoding, > and if they encounter something unexpected (i.e. not "no compression" > nor "zlib" they should throw some "unknown compression scheme" > exception). Or did I miss something ? > >> However, it would be possible to allow for double compression, i.e. >> the use of both numpress and zlib on a binary array, and that would >> require a change in the mapping file. > Why does that require a change in the mapping ? Even if we'd allow > two compression schemes, just having two cvParams won't be specific > enough: > > <cvParam cvRef="MS" accession="MS:..." name="zlib" /> > <cvParam cvRef="MS" accession="MS:..." name="MS-numpress" /> > > Which order do you use ? First numpress, then zlib or the other way around ? > (ok, that's obvious here...) but otherwise you would not know the order. > Because we (hope we'd) only have few "sanctioned" compression schemes and combinations, > I'd suggest to encode that in the cvParam, e.g. either "MS-numpress-zlib" > or "zlib-MS-numpress" just as an example. > Yours, > Steffen > |
From: Fredrik L. <Fre...@im...> - 2013-10-24 15:02:24
|
Initially we also thought that it would be just fine to add the new Numpress terms to the CV and go. However, a concern that was raised is that a reader software that used to be able to read all mzML 1.1 files, suddenly cannot anymore with these files. Well, interpretation of other new CV terms could be a problem for mzML reading software that are supposed to be able to read mzML - but it is usually not that critical. So that's the argument for a minor schema change: causing schema aware software to report a more specific error instead of just failing to read the binaries properly (old files should also validate OK with the new schema so that there is only one schema to handle for reading software). However, if everyone is fine with keeping schema intact, and just accept the possibility to add new compression methods to the CV, that could be just fine. An addendum to the mzML documentation about compression may still be useful, though. Thanks, Fredrik On 2013-10-24 16:38, Matt Chambers wrote: > Sorry, I haven't been checking this list lately... > > I totally agree with Steffen (except MSNumPress is a compression of the > binary data which then must be encoded in base64, which always bloats > the data by ~33% no matter what compression you've done). > > But there are still readers that don't handle zlib compressed mzML or > mzXML, and the only way those readers should be properly dealing with > that is through the existing CV mechanism. A bigger concern (because I > think most readers fall into this category) is for readers that have an > offline and/or hard-coded version of the CV (like pwiz) that has to be > kept in sync. Old versions of pwiz won't be able to recognize the new CV > terms. Looking up new unknown terms online is a long-standing missing > feature in pwiz, but unfortunately it's still not likely to get added > soon. I'm not sure how many of the other offline-CV readers have an > online component yet but I suspect not many. > > -Matt > > > On 10/23/2013 2:21 PM, Steffen Neumann wrote: >> Hi, >> >>> A feasible solution is that files that use any new compression scheme >>> are marked as mzML 1.2. >>> >> I am a bit hesitant to support bumping the schema version >> just for a new encoding of the base64 data: >> >> If you have some raw -> mzML converter that had a command line switch >> whether to num(com)press or not, would that writer need to support >> both XSD versions and decide between uncompressed mzML-1.1 >> and numpressed mzML-1.2 based on that command line switch ? >> That doesn't make sense, and writer software would likely >> write mzML-1.2 compressed or not. >> >> Even if the XSD is identical, old 1.1 software that checks >> the mzML version will fail, even if the mzML-1.2 files >> contain uncompressed data. >> >> All sensible software should look at the cvParams describing >> the encoding in the <binary>. So please why is it insufficient to >> just use the already existing MS-numpress terms in the CV ? >> >> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MS&termId=MS:1000572&termName=binary%20data%20compression%20type >> >> <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" /> >> <cvParam cvRef="MS" accession="MS:1002312" name="MS-Numpress linear prediction compression" /> >> <binary>...</binary> >> >>> However, readers that haven't implemented support >>> for this compression method will break. >> Yes. and Readers that have not implemented mzML-1.2 will do as well. >> A reader *must* look for compression cvParams before decoding, >> and if they encounter something unexpected (i.e. not "no compression" >> nor "zlib" they should throw some "unknown compression scheme" >> exception). Or did I miss something ? >> >>> However, it would be possible to allow for double compression, i.e. >>> the use of both numpress and zlib on a binary array, and that would >>> require a change in the mapping file. >> Why does that require a change in the mapping ? Even if we'd allow >> two compression schemes, just having two cvParams won't be specific >> enough: >> >> <cvParam cvRef="MS" accession="MS:..." name="zlib" /> >> <cvParam cvRef="MS" accession="MS:..." name="MS-numpress" /> >> >> Which order do you use ? First numpress, then zlib or the other way around ? >> (ok, that's obvious here...) but otherwise you would not know the order. >> Because we (hope we'd) only have few "sanctioned" compression schemes and combinations, >> I'd suggest to encode that in the cvParam, e.g. either "MS-numpress-zlib" >> or "zlib-MS-numpress" just as an example. >> Yours, >> Steffen >> > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Eric D. <ede...@sy...> - 2013-10-25 17:13:57
|
Hi everyone, thank you for the discussion. We should take a poll at some point to gather as much input as possible. It is true that there is no schema change and it would not be technically necessary to call numpressed files version 1.2. However, it seems to me that it might be considered inappropriate by many to introduce and even promote a new compression scheme that older readers cannot know about or handle at all until they receive a significant update, and blithely retain the version number 1.1. At the time when we release this officially, there will be several implementations already, but there will also be a lot of software that does not support it. If we don't bump the version number, we would be in a situation where a new kind of mzML 1.1 would arrive in field and there would be lots of software that could not read *that kind of mzML 1.1*. It seems aesthetically preferable to me to have lots of software that cannot read *mzML 1.2*, and be in a position to encourage them to update to 1.2. It has been pointed out that several extant readers can't even handle 1.1's zlib compression. This is true, but this is an indication that they don't fully support the standard as it has been used for many years, and is a different problem. What we're doing here is explicitly introducing something new that hasn't been seen before and is not backwards compatible. Think about it and we should have a poll soon to see how everyone feels. I clearly lean toward the 1.2 side, but am happy to go the other way if that is the consensus. Regards, Eric -----Original Message----- From: Fredrik Levander [mailto:Fre...@im...] Sent: Thursday, October 24, 2013 8:02 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] PSI MS call Tuesday: compression Initially we also thought that it would be just fine to add the new Numpress terms to the CV and go. However, a concern that was raised is that a reader software that used to be able to read all mzML 1.1 files, suddenly cannot anymore with these files. Well, interpretation of other new CV terms could be a problem for mzML reading software that are supposed to be able to read mzML - but it is usually not that critical. So that's the argument for a minor schema change: causing schema aware software to report a more specific error instead of just failing to read the binaries properly (old files should also validate OK with the new schema so that there is only one schema to handle for reading software). However, if everyone is fine with keeping schema intact, and just accept the possibility to add new compression methods to the CV, that could be just fine. An addendum to the mzML documentation about compression may still be useful, though. Thanks, Fredrik On 2013-10-24 16:38, Matt Chambers wrote: > Sorry, I haven't been checking this list lately... > > I totally agree with Steffen (except MSNumPress is a compression of the > binary data which then must be encoded in base64, which always bloats > the data by ~33% no matter what compression you've done). > > But there are still readers that don't handle zlib compressed mzML or > mzXML, and the only way those readers should be properly dealing with > that is through the existing CV mechanism. A bigger concern (because I > think most readers fall into this category) is for readers that have an > offline and/or hard-coded version of the CV (like pwiz) that has to be > kept in sync. Old versions of pwiz won't be able to recognize the new CV > terms. Looking up new unknown terms online is a long-standing missing > feature in pwiz, but unfortunately it's still not likely to get added > soon. I'm not sure how many of the other offline-CV readers have an > online component yet but I suspect not many. > > -Matt > > > On 10/23/2013 2:21 PM, Steffen Neumann wrote: >> Hi, >> >>> A feasible solution is that files that use any new compression scheme >>> are marked as mzML 1.2. >>> >> I am a bit hesitant to support bumping the schema version >> just for a new encoding of the base64 data: >> >> If you have some raw -> mzML converter that had a command line switch >> whether to num(com)press or not, would that writer need to support >> both XSD versions and decide between uncompressed mzML-1.1 >> and numpressed mzML-1.2 based on that command line switch ? >> That doesn't make sense, and writer software would likely >> write mzML-1.2 compressed or not. >> >> Even if the XSD is identical, old 1.1 software that checks >> the mzML version will fail, even if the mzML-1.2 files >> contain uncompressed data. >> >> All sensible software should look at the cvParams describing >> the encoding in the <binary>. So please why is it insufficient to >> just use the already existing MS-numpress terms in the CV ? >> >> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MS&termId=MS:100057 2&termName=binary%20data%20compression%20type >> >> <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" /> >> <cvParam cvRef="MS" accession="MS:1002312" name="MS-Numpress linear prediction compression" /> >> <binary>...</binary> >> >>> However, readers that haven't implemented support >>> for this compression method will break. >> Yes. and Readers that have not implemented mzML-1.2 will do as well. >> A reader *must* look for compression cvParams before decoding, >> and if they encounter something unexpected (i.e. not "no compression" >> nor "zlib" they should throw some "unknown compression scheme" >> exception). Or did I miss something ? >> >>> However, it would be possible to allow for double compression, i.e. >>> the use of both numpress and zlib on a binary array, and that would >>> require a change in the mapping file. >> Why does that require a change in the mapping ? Even if we'd allow >> two compression schemes, just having two cvParams won't be specific >> enough: >> >> <cvParam cvRef="MS" accession="MS:..." name="zlib" /> >> <cvParam cvRef="MS" accession="MS:..." name="MS-numpress" /> >> >> Which order do you use ? First numpress, then zlib or the other way around ? >> (ok, that's obvious here...) but otherwise you would not know the order. >> Because we (hope we'd) only have few "sanctioned" compression schemes and combinations, >> I'd suggest to encode that in the cvParam, e.g. either "MS-numpress-zlib" >> or "zlib-MS-numpress" just as an example. >> Yours, >> Steffen >> > > -------------------------------------------------------------------------- ---- > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktr k > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -------------------------------------------------------------------------- ---- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktr k _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |