[Yaml-core] Model, Transfer Etc.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

We are back to "YAML is like an onion, not like a cake" issues. Both have
layers, but...

I realized you and I read a slightly different meaning into the current
wording. You expect the generic information model to be realized in concrete
programming terms, as a separate programming layer (libyaml, probably).

I've thought of it as a set of constraints on the concrete information model
implementation. In programming terms, you think of it as a concrete class to
wrap and I think about it as an interface to implement.

I understand why you'd want a concrete generic implementation. It would be
ever so nice to be able to implement YPATH, YSIGN etc. using just a single,
universal implemenattion which is totally independent of the schema, data
types etc. used in the document...

Unfortunately this is an impossible goal. Even the lowly integer (10, 012,
0xA) defeats you, never mind the harder cases. I think my view is a more
practical one. With a little bit of care it is possible, using it, to do
everything which can be done without trapping ourselves in impossible
constraints.

I'll try to formalize this a bit. Let's use digital signatures as a test
case because it focuses nicely on information model issues.

The way I see it, there are only two concrete models. One is the serialized
model. Every YAML document is a concrete realization of this model. We both
agree on this, and that not all the information contained in this model is
"significant".

Note that digitally signing a serialized YAML document, including comments,
folding etc. isn't a bad idea. In fact it may make perfect sense in certain
scenarios. The fact that some of the information is "insignificant" to the
application which will act upon the document does not mean this information
is "insignificant" from a legal point of view, for example.

At any rate, the second model is the native one. This isn't a great name...
It is the result of parsing/loading/reading the file into "working memory"
(which may be an on-disk SQL database, but you get the drift). In short it
is what a YAML application is expected to actually work with.

One design goal for YAML which I feel we should strongly adhere to is that
the concrete realization of the second model uses native data structures
rather than generic DOM-like objects. This, however, poses a problem. On the
one hand, we have a strong need to define exactly what information is
significant in this model. The trivial way to do that is to define a set of
DOM-like interfaces (which is what XML has done). But we will be using a
maze of twisty little native data structures, all different; there's no way
we can enforce/require a common set of interfaces.

The solution is to define a set of *constraints* on any concrete in-memory
data structure used to represent YAML data. You could think of this set of
constraints as an "interface" which the concrete realization must
"implement". Hence I think of it as "the abstract data model". Hence there's
no such thing as "an API for the abstract data model" just like there's no
such thing as an instance of an interface class.

Under this view let's revisit our layered processing model:

Serialized model -> Parser -> Abstract model -> Loader -> Native model

It is obvious that what passes between the parser and the loader can't be
something you can work with. The reason to separate the parser from the
loader is to separate the task of extracting structured, uninterpreted
informaton from the file (the parser) and converting it into appropriate
native structures (the loader).

Now, about transfer types/methods, type families etc.

Each node has a serial representation. After parsing/loading it has a native
one. What is allowed in the serial representation is known as "the syntax"
and we've got that one covered pretty well. What is allowed in the native
representation *can't* be fully specified because we'll be using all these
different implementations and hence we can only specify *constraints* on it,
known as "abstract types". These define things like equality, identity etc.
Again there's no instance of an "abstract type" which isn't also an instance
of some concrete "native type".

Once the parser extracted the data from the serialized representation
someone has to convert it into *some* native type. This is the task of the
loader, and the piece of code which does that for a particular node is an
*implementation* of a certain "transfer method".

A transfer method is *defined* in terms of "input: node syntax (style,
format); output: abstract type". An *implementation* of a transfer method
takes data in a format specified by a certain parser (an arbitrary API -
character buffers or whatever) and converts it into a concrete native
object, where the native data type of that object satisfies the requirements
given by the abstract type.

One can think of the "loader" as "the collection of transfer method
implementations". Again, the API between the parser and "the loader" is
completely arbitrary and not something anyone other then a parser or loader
implementor needs to know about. One could in theory think of standard APIs
between the parser and the loader to allow for the same transfer method
implementation to work in different parsers, but that's a whole separate
issue and IMVHO extremely premature to even mention, so forget I said it.
Certainly no application would *ever* be working at this level.

As mentioned in the spec, when a parser sees an unknown transfer method he
has the option to invoke a "catch-all" transfer method known as "encoding".
This method builds on the native data types which are required by YAML (plus
the magical structured keys) to provide a native data object which preserves
the significant information (actually, it preserves *too much*, but that's
the best one can do). On output this native data object would be unharmed
and the application can even do *very limited* processing on its content.

Now, back to signatures. It is obvious it *can't* work in the abstract
model; an implementation will always use *some* concrete native model.
However, we want different implementations of the signature utility, running
using different native models, to give the same result. How can we do that?

Obviously by restricting the implementations to only rely on properties of
the data which are fully specified in the abstract model. It is similar to
using an interface class rather than concrete implementations to ensure that
your code would work no matter the implementation.

However in our case we can't define an actual interface class which you can
code against (because the different implementations use LISP, Python, SQL,
Perl, C++ and who knows what else). In an ideal world we could have used
some IDL and ask everyone to code against that. However this isn't an ideal
world and a Perl programmer would just code using Perl arrays and hashes.

So, back to the signature utility. It has another problem. For an unknown
transfer method, it *doesn't know* what properties of the data are "fully
specified" by the abstract model and what aren't. For example if it doesn't
know what integers are then there is no way it can tell that '10', '012' and
'0xA' have the property of being the number ten.

This means that the signature utility must play it safe. It takes all the
(encoded) data and uses that. This means that if it doesn't know what
integers are, '10', '012' and '0xA' would give different signatures. All of
which would be different from what a utility which does know about integers
would give to the number ten.

Is that acceptable? For digital signatures, probably not. So, if I were
writing a signature utility, I'd reject every unknown transfer method. But
in other cases, it makes perfect sense.

Take the simple task of creating a map object. One of your keys is in an
unknown transfer method. You don't know whether it is equal to another key.
So you assume it isn't. Hence '10' and '012' and '0xA' would be accepted as
different keys in an application which doesn't know what integers are. It
would know that '10' and '10' are equal, because if you have the same
transfer method and the same uninterpreted data, you know you'd get equal
native objects - a transfer method must be deterministic. An application
which does know about integer would recognize '10', '012', '0xA' as
duplicates. That's perfectly acceptable.

Take YPATH. Again, if a YPATH implementation knows the transfer method, it
can match '10' with '0xA'. If it doesn't, it can't. This sucks. Really
sucks. But no wording, twisting, re-layering etc. would help you avoid this
hard fact. The best you can do is require every YPATH implementation to be
aware of a minimal set of transfer methods.

Final point: round trip. It follows from the above that the particular
format ('10' vs. '012'), style (block vs. escaped), *transfer method*
('base64' vs. 'base16') are *NOT* round tripped. What is round tripped is
(1) the abstract type of the value; (2) the value itself (under the equality
operation defined by the abstract type). That's all. Encoding is a simple
way to allow every application to satisfy this when the transfer method is
unknown, by round-tripping both the transfer method and the format.

Summary:
-
 Three models:
  Serialization: 
   - Concrete (files).
   - Syntax (node style).
  Native:
   - Concrete (objects).
   - Classes (native data types).
  Abstract:
   - Constraints on native.
   - Mathematical (schemas, etc.).
-
 Transfer method:
 -
  Definition:
   Input: uninterpreted data (node style, value format).
   Output: abstract type.
 -
  Implementation:
   Input: parser-specific API (uninterpreted data).
   Output: native object.
- The native data type is a property of the native model.
- The transfer method is a property of the serialization model.
- The same transfer method may accept several formats for the same value.
- On output it chooses amongst them.
- Several transfer methods may target the same abstract type.
- On output the dumper chooses amongst them.
- The same abstract type may be realized by several native types.
- The transfer method implementation chooses amongst them.
- If you don't know the transfer method, you *are restricted* but can
round-trip.

Well, this was a long post... but I hope it conveys my views on the issue,
and that I convinced you it is "the right thing to do".

Have fun,

	Oren Ben-Kiki