From: Eric D. <ede...@sy...> - 2007-02-06 21:14:24
|
Hi Frederik, thank you for your input. 1) Regarding "count", this is not a settled topic. There are some who would leave them out and some who like them in. The rationale is that for parsers in languages that would like to pre-allocate arrays to hold the information, it is very handy to have a count. Obviously an incorrect count does no one any good. I believe the plan is to write a validator that will validate several items beyond schema compliance including when important ontology terms are present and whether the ontology terms used are vlid. Validating that the count attributes are accurate should be includes. 2) Your input regarding indexes is very valuable. We had decided to maintain the same indexing scheme as used by mzXML for simplicity of porting software and since it seemed to work for everyone who used it. However, the points you bring up are interesting. I was not aware of the Windows OS issues you bring up. I personally find having separate indexes would be less tidy than everything in a single file. Indexing has always been a somewhat contentious issue, but it is worth discussing. Regards, Eric > -----Original Message----- > From: Fredrik Levander [mailto:Fre...@el...] > Sent: Tuesday, February 06, 2007 7:20 AM > To: PSI MS Dev > Cc: Eric Deutsch > Subject: Comments on dataXML0.9 >=20 > Hello, >=20 > After a quick inspection of dataXML0.9, the first impression is that it > looks very promising and that all people working with it have done some > great work in the fusing of the mz's. >=20 > I've got some small specific comments which I guess you've already > discussed, but anyway: >=20 > 1) The attribute "count" which can be found at some places > (softwareList, spectrumList etc) could be more of an obstacle than of > help. There is no easy way to validate that this number corresponds with > the actual number of list elements in the XML world. Such a validation > would require specific validators for each programming language. In > cases where the count attribute is not equal to the number of elements > in the list, there could be different parsing results depending on if > the implementation is using the 'count' or standard parsing, the later > ignoring the count attribute. Actually, the example file: > http://db.systemsbiology.net/projects/PSI/dataXML/tiny1.dataXML0.9.xml > is an example where the softwareList count=3D"2", but the actual = number of > elements is 3. > I would suggest that either the 'count's are omitted, or PSIDEV should > at some time provide validators which verify that list lengths equals to > the count attribute. Another option is that the attribute is documented > only to be used for visual inspection of files, and that the actual > number of list elements can differ. Are any of the current > mzData-parsers using the 'count's anyway? >=20 > 2) The indexing extension of dataXML. It is evident that such an > indexing is useful for fast file access, and it should definitely be > part of the standard. However, if I understand the schema correctly (no > sample file yet), an indexed file would mean that the dataXML is > encapsuled within the <indexedDataXML>, with the indexing information at > the end of the file. Why not use a separate file for the indexing, which > references the dataXML file as an URI? I think that would make up for > faster data access with the indexes in the beginning of the file, even > if the 'indexOffset' should allow for quick access to the index. A small > consideration is that the offset / indexes would differ depending on if > the file is opened in binary or text mode, at least for large files on a > Windows system. I know that it is working for RAP and mzXML, but for new > implementations which use other libraries and file readers there may be > problems. Anyway, it should be made clear that indexes (offsets) are > for binary file reading (or text if that is the case) > The fileCheckSum would also become clearer with two separate files. It > is quite complex to have the fileCheckSum of the file contained within > the file itself, since the checksum is affected by the writing of the > actual checksum ... If the file checksum is contained in a separate > file it is clear that the checksum is the checksum of the actual > dataXML, excluding the index file. > On the other hand it could be more handy to have just one file to work > with, but is it really such an advantage? In cases where the index file > is lost it would be easy to generate a new index file with a specific > application for generation of dataXML index files anyway. >=20 > Regards >=20 > Fredrik Levander |