[Animl-develop] Comments on AnIML 1.04 (Mark Mullins' proposal)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi again

I have uploaded to CVS Mark Mullins' 1.04 schema and his xml sample (which,
as Karen points out, did not implement the Chromatography technique so is
not yet valid AnIML; still, it is useful to illustrate his intent).  I have
given enough thought to his proposed changes to suggest we find an alternate
solution to his points 1 and 2.

I intend to make the following revisions to AnIML 1.03 to accomodate Mark
Mullins' needs and call it AnIML 1.05; the rationale follows the changes.

(1) Moved the length attribute from VectorSet to EncodedValueSet - addresses
Mark's point 2
(2) Changed AutoIncrementedValueSet and EncodedValueSet from unbounded to 0
to 1
(3) Added a Operation type with values {add, subtract} and used that type as
a new attribute of Reference called "operation" - addresses a prior point of
my own

There is much more, but I will make it separate emails...

regards, Mark Bean

(1) --------------------- VectorSet.length attribute (Mark Mullins' point 2)
Apparently there is chromatography data in existence that is discontinuous
in time - for example 0-5 mins and 10-25 mins.  This was the source of some
of Mark Mullins'(also of SSI) concerns motivating him to make his changes.
I am assured by the president of SSI (makers of EZChrome) that this is a
rare but real situation, perhaps from a single vendor.

AnIML was originally written to permit multiple data containers
ExperimentSteps, Pages, Vectors, ValueSets. Circumscribing these collections
are the non-data containers ExperimentStepSet ("MeasurementData"), PageSet,
VectorSet -- but not a ValueSetSet.  Only one of the non-data collections
has a "length" attribute - VectorSet.  There is no indication what this
means in the schema itself, but according to the Dominik Poetz
documentation, this is not the number of Vectors but rather "how many
elements a vector is supposed to have", which clearly assumes that all
vectors in the VectorSet will have the same length.

Before I discuss Mark Mullins' proposal, I should elaborate on why I think
VectorSet, of all the four possible non-data collections, is the only one to
only have a length attribute.  Length (or Count) is often useful for
programmers in that it permits dimensioning of arrays prior to reading the
data into them (required in many languages).  As Length is not an attribute
of all the collections, Burkhard must have thought that one could obtain the
length for any collection simply by parsing it.  What makes Vector different
is the fact that the number of items in an EncodedValueSet base64Binary
array cannot be obtained directly be a parser, so maybe Burkhard added a
length attribute to handle this.  Because he assumed that all Vectors would
have the same length, he moved it up into VectorSet (ok, that may be a bit
confusing).  Perhaps it should have been named "vectorLength" rather than
"length" to clarify which thing's length it describes.

Now in discontinuous data (described in the first paragraph onf this point
above), one cannot use a VectorSet length attribute describing the number of
elements in a Vector as the number of elements in the Vector ValueSets vary
(e.g. 0-5 and 10-25 mins).  As one is allowed multiple ValueSets per Vector
in AnIML, the concept of VectorSet.length is broken.

Mullins proposed adding a length to Vector and AutoIncrementedValueSet, but
I would prefer more consistent usage of length in the collections, and three
options come to mind:
(a) Add length to every collection (set) in AnIML and thus also have to add
a ValueSetSet (collection of ValueSets)
(b) Omit length from VectorSet (and thus from collections)
(c) Move length from VectorSet to EncodedValueSet under the assumption
Burkhard's intent was to indicate that EncodedValueSet length is a special
case

The Mullins proposal propagates inclusion of a length attribute
inconsistently in collections and might be said to share an additional
weakness pervasive in AnIML - assumption of orderliness between parallel
elements.  In one section of the "AnIML 1.04 lc mockup with errors.animl"
there are two AutoIncrementedValueSets and then two EncodedValueSets
representing the two discontinuous chromatogram segements.  That is
currently legal, but so is a situation where the order of the
AutoIncrementedValueSets is not the same as the order of the
EncodedValueSets.  This assumption of orderliness also exists between
Vectors in a Page and between Templates and ExperimentSteps among other
places, so it can only be considered further propagation of an existing
weakness.

(2) --------------------- AutoIncrementedValueSet and EncodedValueSet now
bounded 0 to 1
To resolve the above difficulties, I changed AutoIncrementedValueSet and
EncodedValueSet from unbounded to 0 to 1 but left InvidualValueSets
unbounded.  This would force us to adopt a different, clearer, but less
compact approach to segmented chromatograms  - using EncodedValueSets for
both the Time and the Intensity Vectors.  As discontinuous data of this sort
is rare, the impact on file size may not be important.  I prefer
constraining the ways we fill AnIML wherever there is a suitable approach
like this.  It also resolves the next point.

(3) ------------------------- References, key\keyrefs (including Mark
Mullins' point 3)
Reference
---------
One or many References may exists in a Page as a reference to a data point
or data point range in a superordinate Page with attributes signableItem,
name, VectorID, index, and refWidth.  A PDA UV-vis spectrum on one page can
refer to a particular index in an associated (derived) UV summed-absorbance
chromatogram.  A mass spectrum on one page can refer to a particular index
and the refWidth number of points (a data range) in an associated (derived)
total-ion-current chromatogram where the width represents a summation of
spectra.

(4) ------------------------- ParameterCategorySet added under
MeasurementData (Mark Mullins' point 1)
This is a reasonable change; nevertheless, we need to explore Templates more
closely (next email) as their references are not well made.

.................................
.......from Mark's email........
1.  Added a ParameterCategorySet node under the MeasurementData node.
This allows custom parameters to be added that describe the measurement data
itself.

2.  Modification of "length" attribute information inside of the VectorSets.
     a.  Changed the definition of the "length" attribute on the VectorSet
node to describe the number of Vectors contained in the VectorSet.
     b.  Added "length" attribute to the Vector node.  This will describe
the number of ValueSets in the Vector.
     c.  Added "length" attribute to the AutoIncrementedValueSet node.  This
will describe the number of values in this AutoIncrementedValueSet.
These changes are necessary to specify the lengths of the individual items,
allowing for each of the individual items to contain any number of subitems.
This will give you the ability to have a VectorSet that contains multiple
vectors, each having a different length.

3.  Added a References node under the IndividualValueSet, EncodedValueSet,
and AutoIncrementedValueSet nodes.
This gives the ability for an individual set of values to be related back to
Vector in a Superordinate page.  Previously, only the entire VectorSet could
be related to a Superordinate Vector.  To accomidate this change for the
EncodedValueSet node, the definition of this node had to be changed to
contain a sub-element node called "Values".

[Animl-develop] Comments on AnIML 1.04 (Mark Mullins' proposal)

Open XML format for analytical chemistry and biology data.

[Animl-develop] Comments on AnIML 1.04 (Mark Mullins' proposal)