From: Fredrik L. <Fre...@el...> - 2007-02-06 15:22:02
|
Hello, After a quick inspection of dataXML0.9, the first impression is that it looks very promising and that all people working with it have done some great work in the fusing of the mz's. I've got some small specific comments which I guess you've already discussed, but anyway: 1) The attribute "count" which can be found at some places (softwareList, spectrumList etc) could be more of an obstacle than of help. There is no easy way to validate that this number corresponds with the actual number of list elements in the XML world. Such a validation would require specific validators for each programming language. In cases where the count attribute is not equal to the number of elements in the list, there could be different parsing results depending on if the implementation is using the 'count' or standard parsing, the later ignoring the count attribute. Actually, the example file: http://db.systemsbiology.net/projects/PSI/dataXML/tiny1.dataXML0.9.xml is an example where the softwareList count="2", but the actual number of elements is 3. I would suggest that either the 'count's are omitted, or PSIDEV should at some time provide validators which verify that list lengths equals to the count attribute. Another option is that the attribute is documented only to be used for visual inspection of files, and that the actual number of list elements can differ. Are any of the current mzData-parsers using the 'count's anyway? 2) The indexing extension of dataXML. It is evident that such an indexing is useful for fast file access, and it should definitely be part of the standard. However, if I understand the schema correctly (no sample file yet), an indexed file would mean that the dataXML is encapsuled within the <indexedDataXML>, with the indexing information at the end of the file. Why not use a separate file for the indexing, which references the dataXML file as an URI? I think that would make up for faster data access with the indexes in the beginning of the file, even if the 'indexOffset' should allow for quick access to the index. A small consideration is that the offset / indexes would differ depending on if the file is opened in binary or text mode, at least for large files on a Windows system. I know that it is working for RAP and mzXML, but for new implementations which use other libraries and file readers there may be problems. Anyway, it should be made clear that indexes (offsets) are for binary file reading (or text if that is the case) The fileCheckSum would also become clearer with two separate files. It is quite complex to have the fileCheckSum of the file contained within the file itself, since the checksum is affected by the writing of the actual checksum ... If the file checksum is contained in a separate file it is clear that the checksum is the checksum of the actual dataXML, excluding the index file. On the other hand it could be more handy to have just one file to work with, but is it really such an advantage? In cases where the index file is lost it would be easy to generate a new index file with a specific application for generation of dataXML index files anyway. Regards Fredrik Levander |