(Note that I changed the subject line, and to ensure continuity, included the full text of Oren's email in my response.)

At 11:59 AM 8/2/2001 +0200, Oren Ben-Kiki wrote:
Joe Lapp [mailto:joe@bugfeathers.com] wrote:
> I'm still trying to get a grip on why YAML-Core
> should provide special places for application-level
> type information.  I can only think of one argument
> that works for me: The purpose a YAML file is to
> express data, even if it happens to be data serialized
> from programming language objects; information the
> serializer needs should be segregated from the data
> to make it easy to tell data from non-data.

That doesn't work for me. To me, (de)serializer isn't
special. There are many "layers" of functionality one
could apply to a document (on parsing or on processing).

What does work for me is: Keep YAML human readable. The
only purpose of the shorthand mechanism is to be a
human-readable shorthand for what we foresee to be a
common case of having a certain type of maps. Nothing
more.

For this reason to make sense it requires two decisions.

1. Should we use a map, or should we use an "attributed
graph" model, as Clark proposed?

In the end, no matter what information model we attach to the YAML syntax, I'm pretty sure every node of the YAML object graph will need at least a type/class attribute.

I think we can borrow a lesson learned from XML.  A specific application may implicitly know the type structure of a YAML document, but generic tools will not.  A generic tool may need a way to know the type any given node.  Since only the specific application has this information, the application must hand this information to the tool.  It can hand the information either via the object tree or via a separate metadata structure.  For the general case, that metadata structure would have to mirror the YAML tree.  Unless the YAML tree itself contains the types, the tool will likely have to perform identical operations on both the YAML tree and the metadata structure.

A utility that tests two graphs for equality is an example of such a tool.

In XML, the situation is actually quite a bit worse than this, and XML-Schema is mostly to blame.  It's very hard to implement schema validation without allowing the the validator to attach metadata to each node.  I very much hope that we could write a YAML-Schema that needed no more than a class name for each node.

The upshot is that YAML node APIs will probably end up with attributes anyway.  We just have to name these attributes in the information model and not allow applications to define their own attributes.


2. If we are using a map, is it OK to make it implicit?
(again, Clark feels uneasy about this).

Clark explained your proposal to me.  I didn't quite get it.  Apparently shorthand notation is indeed a shorthand notation, one for maps.  I assume it's supposed to be equivalent to a map (or supplementary to a map).  I guess that means shorthand notation can only appear on maps.

If this is what you mean, then I too am quite uneasy with it.  You would have to add conflict resolution rules to YAML, and I think in the end users will suffer a little more confusion.  Besides, I hate having two different ways to do the same thing.


I feel very strongly about (1). Yes, we should use just
a simple map/list/scalar model, instead of an attributed
graph model. My reasons are:

- It avoids the need for a DOM. The ability to slurp a
YAML document directly into Perl/Python/Java/JavaScript
etc., use only native APIS, and then spit it out unchanged,
is simply priceless. An "attributed graph" destroys this
ability.

I agree 100%.  What if we limit ourselves to just a class name attribute?

If the only attribute were class name, would we destroy this ability?  If it's not a class name, then it becomes a map pair.  Even then, wouldn't the deserializer consume this information?  Wouldn't it consume the format string, for example?

I do agree, though, that if we have much more than a class name attribute, we'll end up with a DOM that doesn't map well into anything.

The only other two attributes we have are format and default (right?).  I'm uncomfortable with both of them, even if they are clever.  I don't think either is worth its cost in complexity.


- It provides for a unified, single way to handle every
YAML document using a trivial (map/list/scalar) API,
regardless of the functionality which is optionally
layered on top of that. This allows simple utilities
to ignore the complexities of the layered processing.

Your proposal definitely does this, and I think we have to accomplish this, but I'd rather do it by pre-defining the attributes.


Example:

The YAML parser itself would be simpler since it would
know nothing of types, references, etc. These would be
provided by a separate layer. Don't use it - don't even
*write* it - if you don't want to use it. This may be
relevant to embedded devices etc. (e.g., expanding
references may require excessive memory consumption on
such a device).

A YAML diff program should be completely oblivious to
types, references etc. It can work at the most basic
map/list/scalar level. It can be written in straight
Perl, etc.

A YAML verification program would easily be able to
handle broken references, unknown types etc. since
it would NOT expand them on reading. It would just
treat them as normal maps, and do whatever verification
is required without having to somehow hack the YAML
parser.

A YAML-T processor would be able to refer to the type,
reference etc. information using the normal YAML-Path
operations, on both input and output.

So, to summarize, I *strongly* believe that we should
base our information model on simple, unattributed,
map, list and scalar data types.

Accepting (1), (2) becomes a matter of taste. One of
YAML's goals is to be human-readable. This means that:

delivery: %
    ! : date
    = : 2001-07-31

Is unacceptable (at least, Brian and I seem to share
this view). The question becomes, what should we use
instead? If

delivery: [!date] 2001-07-31

Looks "like a scalar" that's intentional; the scalar
syntax is very readable. Granted, it may be too much
like a scalar. But many other variants are possible.
How about one of:

delivery: %(!date &17) 2001-07-31

As a compromise? It costs just one extra character
to the current syntax - a good balance between
making the map explicit and maintaining a readable
format.

I think one of YAML's strongest advantages is that the data looks like what it is.  As clever and clean as this proposal is, I think it weakens that advantage.


I think this answers Clark's misgivings:
>    [The current proposal]
>    has two problems:
>
>    1. It involves an implicit map, which
>       I don't think is obvious.  Looking
>       at the first item I'd assume it is
>       a scalar... not a map.

Would the %(...) syntax solve that?

>    2. It's a bit pathalogic, but what about
>       a user-defined mapping which has
>       two keys... ! and = ? 

What about it? It is a perfectly legal alternative
way to write the exact same map, just like:

key: "\
    v\
    a\
    l\
    u\
    e\
    \
    \"

Is a perfectly legal way to write the exact same pair
as:

key: value

Of course, in both cases this is a rather ugly way to
write the same information...

If the problem is that we have reserved the use of some
single-character keys for our purposes (so that an
application trying to use them "normally" would have
problems), yes, that's a concern. I'm not certain
how serious it is in practice. Every language make
some things reserved...

We can limit the set of shorthand keys in some other
way. For example,

delivery: %(!date) 2001-07-31

Could be a shorthand to:

delivery: %
    __!__ : date
    __=__ : 2001-07-31

Eeeks!!


(This is similar to the python way of reserving keys).
Or we could find some other creative way to minimize the
pattern of reserved keys...

However as long as we accept (1) above, we *must* have some
set of reserved keys. Hmmm... Is this a good time to raise
the namespace issue for keys? :-)

Other issues:

- I love Clark's idea to make the top level production be
one of a map or a list. Best of both worlds indeed. Way to
go, Clark!

Maybe I missed this.  Is this the one that requires lookahead?  How do you tell which you have?  I think lengthy lookaheads are dangerous.


- As for using ^ instead of % for 'transfer encoding', I like
Clark's notion of using | for process chain instead:

picture: %(!image/bmp|gzip|base64 &17) ...

This removes the need for ^ or %. It also means that the
value of a shorthand key may contain any character except
for white space and ( ). Seems reasonable...

I'm not sure why we're inclined to support encodings.  Because of its mandatory indentation, I don't think YAML is a good format for binary data.  I guess there are text encodings too, and you might want to designate that.


Have fun,

        Oren Ben-Kiki

I think we're discussing the information model, hence my next post...

~Joe