From: Angel P. <an...@ma...> - 2007-10-05 15:01:38
|
Just finished going through the specification, which is great BTW. Just have a few notes/questions on the spec/schema as it stands. I'll also post it these to the PSI site. (1) sourceFileRef in multiple places Why does this exist run -> spectrumList -> spectrum [:sourceFileRef => anyURI ] when there is this? run -> sourceFileRefList -> sourceFileRef [:ref => anyURI ] When would you use one over the other or would you have to specify both or what? The spec should cover this a bit better. (2) count attributes in list like element types I *really* don't like the count attribute in list types (e.g. instrumentList[@count]). I think they are not too informative and prone to error (just another condition to code and maintain) -angel |
From: Matthew C. <mat...@va...> - 2007-10-05 16:06:46
|
Angel Pizarro wrote: > Just finished going through the specification, which is great BTW. > Just have a few notes/questions on the spec/schema as it stands. I'll > also post it these to the PSI site. > > (1) sourceFileRef in multiple places > > Why does this exist > run -> spectrumList -> spectrum [:sourceFileRef => anyURI ] > when there is this? > run -> sourceFileRefList -> sourceFileRef [:ref => anyURI ] I agree, I don't see a reason for the sourceFileRefList. Only a sourceFileList is needed. > > When would you use one over the other or would you have to specify > both or what? The spec should cover this a bit better. > > (2) count attributes in list like element types > > I *really* don't like the count attribute in list types (e.g. > instrumentList[@count]). I think they are not too informative and > prone to error (just another condition to code and maintain) If you don't want to maintain the count attributes, ignore them. :) They are mainly useful for human consumption, or if you wanted to write a very (but bulky) fast parser with low error checking. -Matt |
From: Angel P. <an...@ma...> - 2007-10-05 17:00:22
|
On 10/5/07, Matthew Chambers <mat...@va...> wrote: > > Angel Pizarro wrote: > > (2) count attributes in list like element types > > > > I *really* don't like the count attribute in list types (e.g. > > instrumentList[@count]). I think they are not too informative and > > prone to error (just another condition to code and maintain) > If you don't want to maintain the count attributes, ignore them. :) > They are mainly useful for human consumption, or if you wanted to write > a very (but bulky) fast parser with low error checking. The count atts are required, so you can't just ignore them. Plus if you do, then you won't be playing nice with other tools out there that do use them. Meaning that: (a) all write tools must encode counts properly (b) all read tools must check the count attribute's correctness since (a) may not actually be true, thus completely defeating the point of having this attribute in the first place. Simpler on everybody if we just get rid of it in these spots. BTW this is a different issue than the arrayLength attribute on the binary arrays, which I think have enough of a pay-off to justify their existence. -angel -Matt > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Matthew C. <mat...@va...> - 2007-10-05 17:06:26
|
Angel Pizarro wrote: > > The count atts are required, so you can't just ignore them. Plus if > you do, then you won't be playing nice with other tools out there that > do use them. Meaning that: > I meant ignore them while reading, which is entirely possible. Ignoring them while writing would not meet the spec. > (a) all write tools must encode counts properly Yes, and this should be very easy to do. > (b) all read tools must check the count attribute's correctness since > (a) may not actually be true, thus completely defeating the point of > having this attribute in the first place. > I do not see how you come to this conclusion. Read tools do not need to check the correctness, they can choose to parse whatever is there. If a reader reads the count element for spectra and pre-allocates memory for <count> spectra objects, that's the reader's choice. Readers don't HAVE to do that, it's just there for convenience. > Simpler on everybody if we just get rid of it in these spots. I routinely ignore count attributes in my XML parsers. Like I said, it's mainly convenient for human readability. Otherwise, readers have to do a "find all" on the element type, which is also not very hard, but some might complain. -Matt |
From: Mike C. <tu...@gm...> - 2007-10-05 18:17:35
|
On 10/5/07, Matthew Chambers <mat...@va...> wrote: > Angel Pizarro wrote: > > (b) all read tools must check the count attribute's correctness since > > (a) may not actually be true, thus completely defeating the point of > > having this attribute in the first place. > > > I do not see how you come to this conclusion. Read tools do not need to > check the correctness, they can choose to parse whatever is there. I see what you're saying, but I'm very sympathetic to Angel's point as well. If nothing else, a well-written piece of software needs to emit a warning upon seeing an inconsistent count field. This means that it needs to calculate the correct value, which mostly eliminates the value of having it in the file in the first place. Additionally, some pieces of software might want to process these files in a "dumb" mode. So, for example, if a program reads and mzML file with an invalid count and then just prints what it read, it will be generating invalid output. Similarly, it won't be possible to just drop some elements in a dumb way, as counts will need updating, which necessitates a slightly smarter tool. An additional problem with counts is that each program has to decide upon a threshold of outlandishness. That is, if the count says "one trillion", do you trust it and allocate the memory, or balk? History suggests that counts that seem unreasonable now may become typical later, which means that you're buying a maintenance problem. None of this necessarily outweighs the value of providing counts. I will never trust or use them in the programs that I write, though (and I wish I didn't have to generate them). Mike |
From: Matthew C. <mat...@va...> - 2007-10-05 18:32:49
|
Mike Coleman wrote: > I see what you're saying, but I'm very sympathetic to Angel's point as > well. If nothing else, a well-written piece of software needs to emit > a warning upon seeing an inconsistent count field. This means that it > needs to calculate the correct value, which mostly eliminates the > value of having it in the file in the first place. > Steps to implement a reader dealing with invalid count attributes: 1) Ignore the count 2) Keep counters of how many list elements (whatever is being counted) were actually read in 3) If you write out mzML, use your counters' values (subtracting the # of elements filtered out and adding the # of elements added) 4) Voila, valid mzML by ignoring count attributes! :) It's considerably more difficult to implement readers which work with references to previously parsed elements (i.e. ParamGroups and references to them). And that's NOT something that an implementation can ignore! -Matt |
From: Eric D. <ede...@sy...> - 2007-10-08 18:00:24
|
Hi everyone, since the flurry is starting to subside a little, I've started trying to summarize the conversation thus far so converge on some consensus. Regarding this count attribute issue, I tally: - Angel discourages them - Matt is neutral - Mike does not want them - silence from everyone else We had included them at the recommendation from someone who was programming a reader in a language that requires explicit memory allocation. He felt it would be very helpful to have these so that such software could preallocate memory. Obviously, such software would have to be careful about overruns and either generate an error when counts are wrong or gracefully adapt to the reality. If the code would have to gracefully adapt, then perhaps that's no harder than not knowing in the first place. So let's here it from anyone who will want to use count attributes when reading mzML. If no one wants them, we may as well drop them! Thanks, Eric > From: psi...@li... [mailto:psidev-ms-dev- >=20 > Mike Coleman wrote: > > I see what you're saying, but I'm very sympathetic to Angel's point as > > well. If nothing else, a well-written piece of software needs to emit > > a warning upon seeing an inconsistent count field. This means that it > > needs to calculate the correct value, which mostly eliminates the > > value of having it in the file in the first place. > > > Steps to implement a reader dealing with invalid count attributes: > 1) Ignore the count > 2) Keep counters of how many list elements (whatever is being counted) > were actually read in > 3) If you write out mzML, use your counters' values (subtracting the # > of elements filtered out and adding the # of elements added) > 4) Voila, valid mzML by ignoring count attributes! :) >=20 > It's considerably more difficult to implement readers which work with > references to previously parsed elements (i.e. ParamGroups and > references to them). And that's NOT something that an implementation > can ignore! >=20 > -Matt >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2007-10-08 18:10:59
|
Cons: none (it can be harmlessly ignored regardless of whether it's accurate or not) Pros: * shows humans counts without a 'find all' program (especially useful for spectral counts) * allows for implementing a simple preallocating reader with poor error handling (who are we to say that users shouldn't be able to do it?) I'm interested to hear what Brian's found with his explorations in CV schema generation. :) -Matt Eric Deutsch wrote: > Hi everyone, since the flurry is starting to subside a little, I've > started trying to summarize the conversation thus far so converge on > some consensus. > > Regarding this count attribute issue, I tally: > > - Angel discourages them > - Matt is neutral > - Mike does not want them > - silence from everyone else > > We had included them at the recommendation from someone who was > programming a reader in a language that requires explicit memory > allocation. He felt it would be very helpful to have these so that such > software could preallocate memory. > > Obviously, such software would have to be careful about overruns and > either generate an error when counts are wrong or gracefully adapt to > the reality. If the code would have to gracefully adapt, then perhaps > that's no harder than not knowing in the first place. > > So let's here it from anyone who will want to use count attributes when > reading mzML. If no one wants them, we may as well drop them! > > Thanks, > Eric > > |
From: Brian P. <bri...@in...> - 2007-10-08 18:26:23
|
I'm strongly "pro" counts, both for human readability and for performance reasons. They can reduce heap thrash since you can preallocate. Yeah, you can't really trust them, but they *tend* to be right so it *tends* to be the right allocation, but one must program defensively all the same. - Brian >> I'm interested to hear what Brian's found with his explorations >> in CV schema generation. :) Results on the way! Batten down the hatches. -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Monday, October 08, 2007 11:11 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks Cons: none (it can be harmlessly ignored regardless of whether it's accurate or not) Pros: * shows humans counts without a 'find all' program (especially useful for spectral counts) * allows for implementing a simple preallocating reader with poor error handling (who are we to say that users shouldn't be able to do it?) I'm interested to hear what Brian's found with his explorations in CV schema generation. :) -Matt Eric Deutsch wrote: > Hi everyone, since the flurry is starting to subside a little, I've > started trying to summarize the conversation thus far so converge on > some consensus. > > Regarding this count attribute issue, I tally: > > - Angel discourages them > - Matt is neutral > - Mike does not want them > - silence from everyone else > > We had included them at the recommendation from someone who was > programming a reader in a language that requires explicit memory > allocation. He felt it would be very helpful to have these so that such > software could preallocate memory. > > Obviously, such software would have to be careful about overruns and > either generate an error when counts are wrong or gracefully adapt to > the reality. If the code would have to gracefully adapt, then perhaps > that's no harder than not knowing in the first place. > > So let's here it from anyone who will want to use count attributes when > reading mzML. If no one wants them, we may as well drop them! > > Thanks, > Eric > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Angel P. <an...@ma...> - 2007-10-08 19:14:52
|
On 10/8/07, Eric Deutsch <ede...@sy...> wrote: > > > Hi everyone, since the flurry is starting to subside a little, I've > started trying to summarize the conversation thus far so converge on > some consensus. > > Regarding this count attribute issue, I tally: > > - Angel discourages them "discourage" is a good word ;) From Matt and Brain's later replies, it seems that they would rather have them, and I really don't mind if the count att stays. -angel - Matt is neutral > - Mike does not want them > - silence from everyone else > > |
From: Marc S. <st...@in...> - 2007-10-09 08:54:31
|
I would like the count attributes to stay, at least for the spectrum list and peak list. Knowing the number of elements can make a huge performance difference in some languages e.g. C++. - Marc Angel Pizarro wrote: > > Regarding this count attribute issue, I tally: > > - Angel discourages them > ... > - silence from everyone else > |
From: David C. <dc...@ma...> - 2007-10-09 13:05:38
|
I'd like them to stay for the same reason. If the count is correct, then performance is improved slightly and there will be less memory fragmentation. If the count isn't correct, we won't report an error and it will just be less efficient. David Marc Sturm wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. > > - Marc > > Angel Pizarro wrote: >> Regarding this count attribute issue, I tally: >> >> - Angel discourages them >> ... >> - silence from everyone else >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2007-10-09 16:21:07
|
I can see why having a 'count' might make it easier for novice programmers to *write* a processing program, but I cannot see why having a 'count' would make more than a negligible difference in performance, if even that. As a worst case, one could read the mzML file into memory, scan it once to calculate the count, and then proceed as before. The additional time required to do a sweep through RAM would be trivial. Mike On 10/9/07, Marc Sturm <st...@in...> wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. |
From: Chris A. <ch...@ma...> - 2007-10-09 17:25:36
|
Mike Coleman wrote: > I can see why having a 'count' might make it easier for novice > programmers to *write* a processing program, but I cannot see why > having a 'count' would make more than a negligible difference in > performance, if even that. As a worst case, one could read the mzML > file into memory, scan it once to calculate the count, and then > proceed as before. The additional time required to do a sweep through > RAM would be trivial. Isn't one of the features of mzML to store raw scan data? If so I imagine it wouldn't be long before users were generating multi-GB files (even possibly with just peak lists) that: (i) Won't map into the 32bit address space limits of the OS; (ii) Or if you're either using 64bit or else mapping chunks, you'll hit i/o and paging issues as the file will have to be read twice (once for the scan and again for the parser) unless you have a huge amount of RAM of course. Not to mention that the source of the data might not support stream positioning anyway (eg. compressed stream) or which was simply passed as an open stream handle to your program/library and you can't reopen it so you only have one shot. Regards, Chris |
From: Mike C. <tu...@gm...> - 2007-10-09 19:06:30
|
I knew I was going to regret that (over-)simplification. Okay, so in reality I would never actually read the file twice--that's just easier to describe than something more realistic. Just off the top of my head, I would do roughly what C++ std::vector's (or Python lists, etc.) do in terms of memory allocation. This lets you read in a single pass, and uses memory in proportion to what is actually needed. (There are ways to deal with fragmentation as well, but that's *way* outside the bounds of what the mzML spec should care about.) Also worth noting, in my not-so-humble opinion: (a) for general computation, 32-bit hardware is dead, and (b) if you don't have enough RAM to comfortably hold single mzML files, you probably should just buy more. Mike On 10/9/07, Chris Allen <ch...@ma...> wrote: > > Mike Coleman wrote: > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > Isn't one of the features of mzML to store raw scan data? If so I > imagine it wouldn't be long before users were generating multi-GB files > (even possibly with just peak lists) that: > > (i) Won't map into the 32bit address space limits of the OS; > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > i/o and paging issues as the file will have to be read twice (once for > the scan and again for the parser) unless you have a huge amount of RAM > of course. > > Not to mention that the source of the data might not support stream > positioning anyway (eg. compressed stream) or which was simply passed as > an open stream handle to your program/library and you can't reopen it so > you only have one shot. > > Regards, > Chris > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Brian P. <bri...@in...> - 2007-10-09 19:35:12
|
As to performance implications of heap fragmentation, have a look at http://www.microquill.com/ - they sell a nice heap replacement library that can have an impressive impact on program performance without any code changes just by managing the heap more intelligently (I've used it, its for real). But if you can't have a clever heap manager then you have to be clever in how you manage the heap. >> I would do roughly what C++ std::vector's (or Python lists, etc.) do I expect you are referring to the way std::vector initially allocates room for, say, up to 10 items, then when that turns out to be not enough they reallocate for 20, then 40, 80, 160, ..., 655360, 1310720,... - but consider also std::vector's reserve() method, which is a great illustration of the usefulness of the count. It allows you to declare the *expected* final size of the collection without demanding it be the *actual* final size. It preallocates enough memory to accommodate the addition of up to n elements to the vector before any reallocation takes place, and heap fragmentation is thus avoided along with a great many copy constructor executions (which engender even more heapfrag, probably). If an n+1'th element is added, reallocation takes place and performance isn't what it could be, but the program still runs without error. So it's a risk-free and very simple way to use the count info. If your collection class of choice doesn't have some means of exploiting a hint about the expected size of the collection, well, no harm done. Anyone who is not using robust collection classes and is thus susceptible to running off the end of an array allocated based on the declared count is working harder than they need to. But Angel is right, it's fun to trade tips and tricks but we should just vote... I vote keep 'em. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Tuesday, October 09, 2007 12:07 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks I knew I was going to regret that (over-)simplification. Okay, so in reality I would never actually read the file twice--that's just easier to describe than something more realistic. Just off the top of my head, I would do roughly what C++ std::vector's (or Python lists, etc.) do in terms of memory allocation. This lets you read in a single pass, and uses memory in proportion to what is actually needed. (There are ways to deal with fragmentation as well, but that's *way* outside the bounds of what the mzML spec should care about.) Also worth noting, in my not-so-humble opinion: (a) for general computation, 32-bit hardware is dead, and (b) if you don't have enough RAM to comfortably hold single mzML files, you probably should just buy more. Mike On 10/9/07, Chris Allen <ch...@ma...> wrote: > > Mike Coleman wrote: > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > Isn't one of the features of mzML to store raw scan data? If so I > imagine it wouldn't be long before users were generating multi-GB files > (even possibly with just peak lists) that: > > (i) Won't map into the 32bit address space limits of the OS; > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > i/o and paging issues as the file will have to be read twice (once for > the scan and again for the parser) unless you have a huge amount of RAM > of course. > > Not to mention that the source of the data might not support stream > positioning anyway (eg. compressed stream) or which was simply passed as > an open stream handle to your program/library and you can't reopen it so > you only have one shot. > > Regards, > Chris > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Eric D. <ede...@sy...> - 2007-10-09 20:01:56
|
Splendid, we appear to be reaching a conclusion, I tally: - Brian votes to keep - Angel votes to keep - Marc votes to keep - David votes to keep - Eric votes to keep - Matt is neutral - ChrisA is neutral - Mike does not want them - everyone else abstains The ayes have it. The schema stays as is wrt count attributes. Thank you! Eric > -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Tuesday, October 09, 2007 12:34 PM > To: 'Mass spectrometry standard development' > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks >=20 > As to performance implications of heap fragmentation, have a look at > http://www.microquill.com/ - they sell a nice heap replacement library > that > can have an impressive impact on program performance without any code > changes just by managing the heap more intelligently (I've used it, its > for > real). But if you can't have a clever heap manager then you have to be > clever in how you manage the heap. >=20 > >> I would do roughly what C++ std::vector's (or Python lists, etc.) do >=20 > I expect you are referring to the way std::vector initially allocates room > for, say, up to 10 items, then when that turns out to be not enough they > reallocate for 20, then 40, 80, 160, ..., 655360, 1310720,... - but > consider > also std::vector's reserve() method, which is a great illustration of the > usefulness of the count. It allows you to declare the *expected* final > size > of the collection without demanding it be the *actual* final size. It > preallocates enough memory to accommodate the addition of up to n elements > to the vector before any reallocation takes place, and heap fragmentation > is > thus avoided along with a great many copy constructor executions (which > engender even more heapfrag, probably). If an n+1'th element is added, > reallocation takes place and performance isn't what it could be, but the > program still runs without error. So it's a risk-free and very simple way > to use the count info. >=20 > If your collection class of choice doesn't have some means of exploiting a > hint about the expected size of the collection, well, no harm done. > Anyone > who is not using robust collection classes and is thus susceptible to > running off the end of an array allocated based on the declared count is > working harder than they need to. >=20 > But Angel is right, it's fun to trade tips and tricks but we should just > vote... I vote keep 'em. >=20 > - Brian >=20 > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Tuesday, October 09, 2007 12:07 PM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks >=20 > I knew I was going to regret that (over-)simplification. Okay, so in > reality I would never actually read the file twice--that's just easier > to describe than something more realistic. Just off the top of my > head, I would do roughly what C++ std::vector's (or Python lists, > etc.) do in terms of memory allocation. This lets you read in a > single pass, and uses memory in proportion to what is actually needed. > (There are ways to deal with fragmentation as well, but that's *way* > outside the bounds of what the mzML spec should care about.) >=20 > Also worth noting, in my not-so-humble opinion: (a) for general > computation, 32-bit hardware is dead, and (b) if you don't have enough > RAM to comfortably hold single mzML files, you probably should just > buy more. >=20 > Mike >=20 >=20 > On 10/9/07, Chris Allen <ch...@ma...> wrote: > > > > Mike Coleman wrote: > > > I can see why having a 'count' might make it easier for novice > > > programmers to *write* a processing program, but I cannot see why > > > having a 'count' would make more than a negligible difference in > > > performance, if even that. As a worst case, one could read the mzML > > > file into memory, scan it once to calculate the count, and then > > > proceed as before. The additional time required to do a sweep through > > > RAM would be trivial. > > > > Isn't one of the features of mzML to store raw scan data? If so I > > imagine it wouldn't be long before users were generating multi-GB files > > (even possibly with just peak lists) that: > > > > (i) Won't map into the 32bit address space limits of the OS; > > > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > > i/o and paging issues as the file will have to be read twice (once for > > the scan and again for the parser) unless you have a huge amount of RAM > > of course. > > > > Not to mention that the source of the data might not support stream > > positioning anyway (eg. compressed stream) or which was simply passed as > > an open stream handle to your program/library and you can't reopen it so > > you only have one shot. > > > > Regards, > > Chris > > > > > > ------------------------------------------------------------------------ > - > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Brian P. <bri...@in...> - 2007-10-09 16:41:39
|
Heap fragmentation has a performance cost that persists past the initial allocation(s), since it affects further allocations as well. If it can be avoided with a relatively simple mechanism like this, that's a good thing. I started coding in 1977, FWIW. Long enough to learn to prefer the simple solution over the one that requires a gestalt... To be fair, having done this stuff for a long time isn't really a predictor of me being any good at it, but I get by OK. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Tuesday, October 09, 2007 9:21 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks I can see why having a 'count' might make it easier for novice programmers to *write* a processing program, but I cannot see why having a 'count' would make more than a negligible difference in performance, if even that. As a worst case, one could read the mzML file into memory, scan it once to calculate the count, and then proceed as before. The additional time required to do a sweep through RAM would be trivial. Mike On 10/9/07, Marc Sturm <st...@in...> wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2007-10-09 18:53:23
|
On 10/9/07, Brian Pratt <bri...@in...> wrote: > Heap fragmentation has a performance cost that persists past the initial > allocation(s), since it affects further allocations as well. If it can be > avoided with a relatively simple mechanism like this, that's a good thing. > > I started coding in 1977, FWIW. Long enough to learn to prefer the simple > solution over the one that requires a gestalt... I agree that there would be an affect, but my guess is that it would be minimal in real-world situations. I also appreciate the value of simplicity. The question here, in my mind, is figuring out what kinds of simplicity are best and figuring out how to trade them off. If you accept my premise that the 'count' value in the input cannot be fully trusted, then working out the cases and producing the value in the output seems more complex than just counting them as they come in. (This is a pretty minor consideration in the greater scheme of things, though.) > To be fair, having done this stuff for a long time isn't really a predictor > of me being any good at it, but I get by OK. If you had asked me at any point in my career when I had achieved basic competence as a programmer, I would have replied "about four or five years ago". So, in retrospect, my total years of "less-than-competence" are increasing as time goes by... :-) Mike |
From: Matthew C. <mat...@va...> - 2007-10-09 18:45:40
|
We are a bit off topic but this is interesting. :) To really assess the performance issues here you have to dig deeper than just heap fragmentation though. Assuming a list to store the SpectrumHeaders and vectors to store ms and intensities, and without preallocation based on counts, because of the tree-like nature of mzML, you'd end up with a memory footprint like: Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader SpectrumNMz1...P SpectrumNInten1...P If you preallocated the SpectrumHeaders in the list based on the count attribute, you'd instead get a footprint like: Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P So you're going to have a tradeoff of fragmentation either way. The fragmentation in the first case would be worse for quick sequential access to each SpectrumHeader, but better for accessing the peaks of a particular spectrum. The fragmentation in the second case would be better for quick sequential access to each SpectrumHeader, but worse for accessing the peaks of a particular spectrum. Access to the peaks could be further improved by storing the Mz and Inten values together (i.e. in a struct { float mz, inten; } ). This is all incredibly superfluous though and I doubt this fragmentation has an appreciable performance impact on data with any kind of density to it. So if you needed extremely responsive performance on very sparse spectra, you might think about this stuff, but most of us are far more limited by the sheer number of peaks. And if extreme responsiveness is your goal, no conceivable XML format is going to help you! -Matt Brian Pratt wrote: > Heap fragmentation has a performance cost that persists past the initial > allocation(s), since it affects further allocations as well. If it can be > avoided with a relatively simple mechanism like this, that's a good thing. > > I started coding in 1977, FWIW. Long enough to learn to prefer the simple > solution over the one that requires a gestalt... > > To be fair, having done this stuff for a long time isn't really a predictor > of me being any good at it, but I get by OK. > > - Brian > > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Tuesday, October 09, 2007 9:21 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks > > I can see why having a 'count' might make it easier for novice > programmers to *write* a processing program, but I cannot see why > having a 'count' would make more than a negligible difference in > performance, if even that. As a worst case, one could read the mzML > file into memory, scan it once to calculate the count, and then > proceed as before. The additional time required to do a sweep through > RAM would be trivial. > > Mike > > > |
From: Angel P. <an...@ma...> - 2007-10-09 19:00:58
|
Hi all, I was never arguing against counts for the spectra, only *maybe* against annotations, and it seems that more people than not want them in, so I say keep 'em. In the interest of not diverting effort from more important issues, can we just take a vote and leave it at that? my vote: keep counts -angel On 10/9/07, Matthew Chambers <mat...@va...> wrote: > > We are a bit off topic but this is interesting. :) To really assess the > performance issues here you have to dig deeper than just heap > fragmentation though. Assuming a list to store the SpectrumHeaders and > vectors to store ms and intensities, and without preallocation based on > counts, because of the tree-like nature of mzML, you'd end up with a > memory footprint like: > Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header > Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader > SpectrumNMz1...P SpectrumNInten1...P > > If you preallocated the SpectrumHeaders in the list based on the count > attribute, you'd instead get a footprint like: > Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P > Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P > > So you're going to have a tradeoff of fragmentation either way. The > fragmentation in the first case would be worse for quick sequential > access to each SpectrumHeader, but better for accessing the peaks of a > particular spectrum. The fragmentation in the second case would be > better for quick sequential access to each SpectrumHeader, but worse for > accessing the peaks of a particular spectrum. Access to the peaks could > be further improved by storing the Mz and Inten values together (i.e. in > a struct { float mz, inten; } ). This is all incredibly superfluous > though and I doubt this fragmentation has an appreciable performance > impact on data with any kind of density to it. So if you needed > extremely responsive performance on very sparse spectra, you might think > about this stuff, but most of us are far more limited by the sheer > number of peaks. And if extreme responsiveness is your goal, no > conceivable XML format is going to help you! > > -Matt > > Brian Pratt wrote: > > Heap fragmentation has a performance cost that persists past the initial > > allocation(s), since it affects further allocations as well. If it can > be > > avoided with a relatively simple mechanism like this, that's a good > thing. > > > > I started coding in 1977, FWIW. Long enough to learn to prefer the > simple > > solution over the one that requires a gestalt... > > > > To be fair, having done this stuff for a long time isn't really a > predictor > > of me being any good at it, but I get by OK. > > > > - Brian > > > > > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On Behalf Of Mike > > Coleman > > Sent: Tuesday, October 09, 2007 9:21 AM > > To: Mass spectrometry standard development > > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks > > > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > > > Mike > > > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Angel P. <an...@ma...> - 2007-10-05 17:14:29
|
On 10/5/07, Matthew Chambers <mat...@va...> wrote: > > Angel Pizarro wrote: > > Just finished going through the specification, which is great BTW. > > Just have a few notes/questions on the spec/schema as it stands. I'll > > also post it these to the PSI site. > > > > (1) sourceFileRef in multiple places > > > > Why does this exist > > run -> spectrumList -> spectrum [:sourceFileRef => anyURI ] > > when there is this? > > run -> sourceFileRefList -> sourceFileRef [:ref => anyURI ] > I agree, I don't see a reason for the sourceFileRefList. Only a > sourceFileList is needed. Geez, I just realized with Matt's response what these attrs were for! OK, that means that the spec needs to be augmented a bit with the following information: mzML -> [fileDescription -> sourceFileList -> sourceFile ] all these tags in brackets need better documentation mzML -> run -> ssourceFileRefList -> sourceFileRef should probably go away, since there can only be one run present in an mzML file, hence what is the point of referencing more files in mzML -> fileDescription -> sourceFileList than were output to the run? mzML -> run -> spectrumList -> spectrum -> sourceFileRef should documented to say that this reference is an internal pointer to the particular sourceFile in the whole document's sourceFileList that gave rise to this particular spectrum. If a sourceFile contains more than one spectrum, then the scanNumber attribute serves to disambiguate it from its siblings. -angel |