Re: [Psidev-ms-dev] Comments on dataXML0.9

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Frederik, thank you for your input.

1) Regarding "count", this is not a settled topic. There are some who
would leave them out and some who like them in.  The rationale is that
for parsers in languages that would like to pre-allocate arrays to hold
the information, it is very handy to have a count. Obviously an
incorrect count does no one any good.  I believe the plan is to write a
validator that will validate several items beyond schema compliance
including when important ontology terms are present and whether the
ontology terms used are vlid. Validating that the count attributes are
accurate should be includes.

2) Your input regarding indexes is very valuable.  We had decided to
maintain the same indexing scheme as used by mzXML for simplicity of
porting software and since it seemed to work for everyone who used it.
However, the points you bring up are interesting. I was not aware of the
Windows OS issues you bring up. I personally find having separate
indexes would be less tidy than everything in a single file. Indexing
has always been a somewhat contentious issue, but it is worth
discussing.

Regards,
Eric

> -----Original Message-----
> From: Fredrik Levander [mailto:Fre...@el...]
> Sent: Tuesday, February 06, 2007 7:20 AM
> To: PSI MS Dev
> Cc: Eric Deutsch
> Subject: Comments on dataXML0.9
>=20
> Hello,
>=20
> After a quick inspection of dataXML0.9, the first impression is that
it
> looks very promising and that all people working with it have done
some
> great work in the fusing of the mz's.
>=20
> I've got some small specific comments which I guess you've already
> discussed, but anyway:
>=20
> 1) The attribute "count" which can be found at some places
> (softwareList, spectrumList etc) could be more of an obstacle than of
> help. There is no easy way to validate that this number corresponds
with
> the actual number of list elements in the XML world. Such a validation
> would require specific validators for each programming language. In
> cases where the count attribute is not equal to the number of elements
> in the list, there could be different parsing results depending on if
> the implementation is using the 'count' or standard parsing, the later
> ignoring the count attribute. Actually, the example file:
> http://db.systemsbiology.net/projects/PSI/dataXML/tiny1.dataXML0.9.xml
> is an example where the softwareList count=3D"2", but the actual =
number
of
> elements is 3.
> I would suggest that either the 'count's are omitted, or PSIDEV should
> at some time provide validators which verify that list lengths equals
to
> the count attribute. Another option is that the attribute is
documented
> only to be used for visual inspection of files, and that the actual
> number of list elements can differ. Are any of the current
> mzData-parsers using the 'count's anyway?
>=20
> 2) The indexing extension of dataXML.  It is evident that such an
> indexing is useful for fast file access, and it should definitely be
> part of the standard. However, if I understand the schema correctly
(no
> sample file yet), an indexed file would mean that the dataXML is
> encapsuled within the <indexedDataXML>, with the indexing information
at
> the end of the file. Why not use a separate file for the indexing,
which
> references the dataXML file as an URI?  I think that would make up for
> faster data access with the indexes in the beginning of the file, even
> if the 'indexOffset' should allow for quick access to the index. A
small
> consideration is that the offset / indexes would differ depending on
if
> the file is opened in binary or text mode, at least for large files on
a
> Windows system. I know that it is working for RAP and mzXML, but for
new
> implementations which use other libraries and file readers there may
be
> problems. Anyway,  it should be made clear that indexes (offsets) are
> for binary file reading (or text if that is the case)
> The fileCheckSum would also become clearer with two separate files. It
> is quite complex to have the fileCheckSum of the file contained within
> the file itself, since the checksum is affected by the writing of the
> actual checksum ... If the file checksum is contained  in a separate
> file it is clear that the checksum is the checksum of the actual
> dataXML, excluding the index file.
> On the other hand it could be more handy to have just one file to work
> with, but is it really such an advantage? In cases where the index
file
> is lost it would be easy to generate a new index file with a specific
> application for generation of dataXML index files anyway.
>=20
> Regards
>=20
> Fredrik Levander