[Psidev-ms-dev] Comments on dataXML0.9

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

After a quick inspection of dataXML0.9, the first impression is that it 
looks very promising and that all people working with it have done some 
great work in the fusing of the mz's.

I've got some small specific comments which I guess you've already 
discussed, but anyway:

1) The attribute "count" which can be found at some places 
(softwareList, spectrumList etc) could be more of an obstacle than of 
help. There is no easy way to validate that this number corresponds with 
the actual number of list elements in the XML world. Such a validation 
would require specific validators for each programming language. In 
cases where the count attribute is not equal to the number of elements 
in the list, there could be different parsing results depending on if 
the implementation is using the 'count' or standard parsing, the later 
ignoring the count attribute. Actually, the example file:
http://db.systemsbiology.net/projects/PSI/dataXML/tiny1.dataXML0.9.xml
is an example where the softwareList count="2", but the actual number of 
elements is 3.
I would suggest that either the 'count's are omitted, or PSIDEV should 
at some time provide validators which verify that list lengths equals to 
the count attribute. Another option is that the attribute is documented 
only to be used for visual inspection of files, and that the actual 
number of list elements can differ. Are any of the current 
mzData-parsers using the 'count's anyway?

2) The indexing extension of dataXML.  It is evident that such an 
indexing is useful for fast file access, and it should definitely be 
part of the standard. However, if I understand the schema correctly (no 
sample file yet), an indexed file would mean that the dataXML is 
encapsuled within the <indexedDataXML>, with the indexing information at 
the end of the file. Why not use a separate file for the indexing, which 
references the dataXML file as an URI?  I think that would make up for 
faster data access with the indexes in the beginning of the file, even 
if the 'indexOffset' should allow for quick access to the index. A small 
consideration is that the offset / indexes would differ depending on if 
the file is opened in binary or text mode, at least for large files on a 
Windows system. I know that it is working for RAP and mzXML, but for new 
implementations which use other libraries and file readers there may be 
problems. Anyway,  it should be made clear that indexes (offsets) are 
for binary file reading (or text if that is the case)
The fileCheckSum would also become clearer with two separate files. It 
is quite complex to have the fileCheckSum of the file contained within 
the file itself, since the checksum is affected by the writing of the 
actual checksum ... If the file checksum is contained  in a separate 
file it is clear that the checksum is the checksum of the actual 
dataXML, excluding the index file.
On the other hand it could be more handy to have just one file to work 
with, but is it really such an advantage? In cases where the index file 
is lost it would be easy to generate a new index file with a specific 
application for generation of dataXML index files anyway.

Regards

Fredrik Levander