Re: [Yaml-core] Re: equality and identity (spec finishing touches)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thursday 09 September 2004 23:17, David Hopwood wrote:
> It can do. The fact that a tag is associated with a single kind is
> simply a matter of convention; it isn't enforced, and can't be
> enforced for unrecognized tags.

Again: The spec can't "enforce" *anything*. The spec simply states what 
call a "YAML schema", what is we call a "conforming YAML processor" and 
so on. So, you are free to have tags which apply to more than one kind. 
Just don't call them YAML tags and don't call your schema a YAML 
schema. If you do, well, we'll complain loudly. Since YAML is a 
trademark (I think), we might even be able to force you to rename your 
stuff to "NotYaml"... But we can't and don't have the faintest 
motivation to make you stop doing whatever works for you :-)

> If implementations assume this 
> property in other places where it isn't used redundantly, they will
> get into trouble.

-10. Must applications have to carry a load of redundant data and extra 
processing logic on the off-chance someone feeds them NotYaml 
pretending to be YAML?

> Nodes don't represent native objects; native objects represent nodes.

-10. This is a major point. YAML is a data serialization language.

	"In the beginning, there was the data".

It is _then_ serialized. When you load a YAML document, you are 
*RE*constructing the "data" that was originally serialized into the 
document.

It doesn't matter if the document was written by hand. Whoever wrote the 
document had some "data" - whether in a computer storage or just in his 
mind - which he serialized into YAML. Loading - either into computer 
storage or into the mind of a human reader - reverses the process.

It _can't_ happen that a YAML document gets created spontaneously and 
_then_ someone extracts the data from it. The big bang wasn't a 
well-formed YAML document, and neither are virtual particle pairs :-)

So... Nodes _do_ represent native objects. That's why we call this a 
"representation" :-)

Now, the whole "identity" thing makes more sense. It talks about being 
able to serialize data where the same object is reachable through 
different paths. Operationally, when loading the data, you want to say 
"the object _here_ is the same one as the one I already loaded from 
_there_".

The spec says that if you present your data object as a YAML scalar, 
then YAML will _not_ preserve the identity of your native objects; when 
someone else loads the document, in his "data", all equal loaded 
objects may be the same object - or they might each be a different one 
even if you used anchors in the YAML document.

The spec it says that if you present your data as YAML collections, then 
YAML _will_ preserve the identity; whenever anyone will load the YAML 
document, he'll have exactly one object per identical node, so by using 
anchors you can ensure he'll have exactly one data object for each of 
your data objects.

And again, if an application depends on this rule, and some schema says 
differently, "bad things will happen". Well, again: an implementation 
need not be burdened with defensive programming against someone trying 
to pass a NonYaml as if it was YAML.

> ... a Python API ... cannot
> represent every valid element of the YAML node graph model
> distinctly...

If your original data is Python, you simply define a strict YAML schema 
that works for you, which is actually quite easy even if Python doesn't 
distinguish between !!int and !!float. Just say all your numbers 
are !!float, for example, even if they have no '.' in them. Seems 
simple enough to me.

However, if you load someone's schema for serializing Java data into 
Python, you will face problems, because his data uses types that have 
no direct match in your types set. You have two options:

- Strictly adhere to the semantics he had in mind for his data, by 
defining new Python types, or using shadow objects to preserve 
non-Python information, or any other trick. Costly, difficult, but is 
strict YAML and, more importantly, you are safe knowing that you did 
not mangle what the document means (the original semantic of the data).

- Take shortcuts, coerce types to the nearest not-quite-appropriate 
Python type, and so on. Fine. As you point out, this is quite 
reasonable in many cases. Again, the spec doesn't forbid you from doing 
that. It just asks you nicely (OK, we'll sue you for every penny you 
got if you don't :-) to clearly document "if you use the parser in this 
mode, it isn't a simple loading of a YAML document. What you get is NOT 
what the original document writer has intended. But it is very simple 
and vey useful, so you will want to do this anyway".

> ... to handle YAML documents generically, without making
> assumptions that are only valid in the context of particular
> applications/formats, it's absolutely necessary to have an
> unambiguous representation that can distinctly represent every
> possible YAML node graph.

What is important is to agree that the spec can lay down rules in the 
first place! Whatever rules you come up with, someone will say "well, 
you can't enforce that, so my code also have to deal with the case the 
document breaks this rule". You'll be getting nowhere fast.

I think the current spec does a good job at defining a way to handle 
YAML documents generically without assuming anything about the 
particular *valid YAML schema* (== "application"). If the document does 
not follow a "valid YAML schema", it isn't YAML, and I can't say 
_anything_ about it. What if it uses the indentation level to encode 
information? What if comments contain meaningful data? What if the 
meaning depends on the time of day I read it at?

Have fun,

	Oren Ben-Kiki