OK, Clark's desperate plea for help aside :-), I want to explore the DWIM
proposal some more. Specifically, in the context of equality comparisons.
I'm going to resurrect the "native model" for this posting at least. Defined
- Made out of nodes;
- Each node is a scalar, mapping or sequence;
- A scalar has an opaque value;
- A sequence is an ordered set of nodes;
- A mapping is a function from one unordered set of different nodes to
- Each node has a native data type.
- Each type defines the equality operator.
- Each node has an identity.
Now, each and every data type in the world can be viewed as a collection of
nodes. It does not follow that there's a 1-1 mapping between nodes and
"objects" or whatever. For example, in the PHP example of an ordered map,
there are some "artificial" nodes required along the way. But if one accepts
that, the above is strong enough to represent anything (counter examples if
you disagree, please).
The problem with the native model is that scalar values are opaque. This
means that an API working at this level has severe problems dealing with
such values - though at least in some languages it may be possible to do so
anyway, at least to a limited degree.
At any rate, to allow constructing APIs that can deal with values, YAML
provides two constructs, type family and format. A type family is an
abstraction of a data type (let's not delve to deeply here, or we'll have to
get into set theory and all sort of nasty math). A format is a method of
converting the opaque value into a Unicode string. A canonical format
preserves a 1-1 mapping between opaque values and such strings. So far, so
Now, the graph model associates a type family and a format with each node.
It must, because it can't very well use the native data type (too platform
specific) and the native value (it is opaque). Also, to allow for
comparisons, it seems the graph model must provide _canonical_ format. So
much seems inevitable if we start from the native model.
At the same time, if we start with the syntax model, we have the problem of
implicit types and non-canonical formats. Obviously, one can't do
comparisons there (nor any other operation on values). In effect, values in
the syntax and serial models are as opaque as are values in the native
The loader has the job of reconciling these two views. Two issues are
- This is impossible to do without knowledge of the "semantics". At least at
the level of knowing all the relevant type families and their formats, how
to convert from one format to another, and so on.
- This is a big pile of dung because this set depends on the semantics of
the particular document and requires *code* specific to that semantics.
We now refer to the sign saying "semantics-blind vs. semantics-aware".
A fully semantics-aware tool has no problem; it knows all the types.
A semantics-blind tool can't do the job. Just can't. The graph model as
described above is beyond it.
So. We have several options.
- Water down the graph model so it would be within reach of such tools; that
means basically that the graph model allows multiple paths to each node (by
interpreting aliases) but does not support any comparisons; in effect values
are opaque all the way through. That's too restrictive.
- Require that such tools would be enhanced to know all the type families
involved in the document. If they don't recognize even one, they choke. This
means YGREP or YAML-pretty-print would not be able to work on a document
without being given the code for _all_ the types in it. Again, too
- Live with it. Allow the tools to be given code to better understand types,
yes; promote some types as universal, certainly; but accept that these tools
will work with a model that is an _approximation_ to the ideal graph model
described above. In this approximate model, some nodes are "known" - have
known type families, support type-aware comparisons using the canonical
format, and so on. Other nodes are "unknown". Whether or not they have a
type family and format, the string value is "opaque". On these nodes, the
only operations possible are dumb string comparisons.
Dealing with "unknown" nodes isn't fun. It isn't safe. But it is very
useful. YAML-Pretty-Print is an example. Even YPATH can work to a reasonable
extent. For some types, there is only a canonical format (IP addresses are a
good example), or the canonical format is the most commonly used (E-mail
addresses, URLs). More importantly, given my YPATH tool knows about integers
but not about currency codes (say), I expect to be able to search for the
integer 10 and find 012, and I'll live with having to look for the exact
string for "USD 12.5" (or whatever). If I can't, I'll download a currency
plug-in for my YPATH tool.
How to phrase it in our models view? By allowing information to be partial
in the graph model. Sure, it makes life hard. But there's no way around it.
We now come, at long last, to DWIM vs. #IMPLICIT. There are two orthogonal
issues here, really. One is the above; whether the graph model should be
full or partial. I hope the above makes a convincing case that we should
accept it being partial. Well, if we agree that this is the case, there's no
need for () notation. All unquoted flow scalars have a missing type family;
it is the loader's job to figure them out. If it can't, well, they are
"unknown" - big deal; so are "!foo" nodes.
The second issue is this. How should a semantics-blind tool know which types
to load for each given document? Again there are several options:
- Declare it up front by a #SCHEMA directive that points to a document
listing all these types (and probably more besides).
- List implicits using #IMPLICIT, and use the type family name for
- Shift the burden to the instructions given to the semantics-blind tool
when invoked for the specific document.
The latter seems a cop out. I want to run YGREP to run on a given document.
What, I'm supposed to give it (in the command line, say) the list of all
implicit types in it? Obviously not.
However, what you _do_ give YGREP is a search expression. And just like the
YAML document provides explicit type families to some nodes, the search
expression used explicit operator names (e.g., "equal-integer"). These
operators tell YGREP exactly what implicit types you are interested in. The
same goes for CYATL, YPATH, and so on. After all, a semantics-blind tool
_can't_ invoke any such operator on its own; it _must_ be told to do so from
the outside (otherwise it isn't semantics-blind).
So it all falls together. I ask YGREP to look "equal-integer(10)". It looks
for either "!int" _or_, for implicitly typed scalar nodes, the regexp
[+-]?\d+ or [+-]0x\d+ or whatever. It knows about _both_ "!int" and these
regexps because they are part of the package that supplied "equal-integer".
It finds my integer, silently ignoring the existence of IP addresses, URLs,
and foos in the document - because I didn't ask it to care about them.
Now, suppose I used #IMPLICIT, and told YGREP that the document uses foos,
and IP addresses, and whatever else. And it does not know about foos,
because it is an old YGREP and foos were only accepted to the registry
yesterday. I'm still searching for my integer.
Should YGREP barf? Obviously not. It should only barf if I ask it for
"equal-foo(xyzzy)". But it would _anyway_, because it has no such registered
operator. So what did #IMPLICIT give me? Nothing. What did it cost? A big
headache writing all the #IMPLICITs in the beginning of each document (any
bets that most people will just write #IMPLICIT:everything-in-the-world just
so they won't have to worry about using one they didn't declare, even if
they only use just one or two?).
- The graph model should provide the type family and the value under the
_canonical_ format (and not preserve the original format);
- But it must also allow for "unknown" type family/format with an opaque
- There's no need for #IMPLICIT.
That's the DWIM proposal.