Thread: RE: [Yaml-core] yar

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Guys,

I'm back at last from my convention (which was lots of fun, and educational
too). I finally had the time to catch up on the 50 messages in this list
since I left... You two have sure been busy, and doing a great job.

I also went through the YAML spec draft - great going, Clark - first "on its
own" and second after reviewing the messages for context. So, at the risk of
this becoming a long message :-), here's my view of the current state.

=== Rationale ===

A good rationale is harder to write than the original spec. On the other
hand, a spec with a good rationale is much more useful. I think we should
give this some thought. We have the advantage we don't meet face-to-face, so
this list provides a good record of how each point was decided...

=== Encoding/Binary ===

It seems as though the problem here is that we are trying to handle two
separate "data types" in the same way. A string of (Unicode) characters is a
different beast than a binary blob of bytes.

We have decided that YAML will be "printable" - that is, a YAML file is a
string, using only the "printable" subset of the Unicode characters.
Implicitly this means it may contain characters beyond 7-bit ASCII - but
these are still Unicode characters and not arbitrary binary bytes.

Therefore, when a binary blob is written into YAML format it must be
encoded. Clark suggested using base64 as the universal way to achieve that.
This seems reasonable, with the objection that it is a horribly wasteful way
to encode binary blobs under UTF-16. I wonder if there a base4096 or
something for that case? I never heard of one...

At any rate, note that you can't achieve the same effect by using something
"\xXX" or "\uUUUU". These escape sequences still denote text *characters*,
not blob *bytes*. This distinction isn't automatic to old-time C programmers
(like myself) used to "char == byte". But it is essential to make it,
otherwise things get very messy.

Clark suggested that all scalar syntaxes are to be equivalent - it doesn't
matter if one is using base64, quoted string, block or "simple value"; the
in-memory result is the same. As shown above, this can't be true for
*writing* values; a binary blob may only be written in base64. Why,
therefore, not make the same distinction when *reading* values?

Supposing the information model does make this distinction, then Brian's
(wonderful) YAR format becomes possible, regardless of encoding. If a file
contains only or "mostly" printable characters, then emit it as a text
block. Otherwise, emit it as a base64 blob. This requires a two-pass
algorithm through the file, but I think there's no helping that. In the Mac
and Be, if the MIME type is text/*, emit it as a block, otherwise as a
base64 blob.

How to make this distinction in the data model is an open issue, of course.
In Java, which was born Unicode-aware, there's no problem distinguishing
between a String and a byte[]. I assume something similar may be done in
Python and Perl. C is trickier - I don't quite see how to work it into the
API - but C programmers already know that "char = byte" doesn't mean "string
= blob"; a string ends with a \0, a blob has a length.

=== Top Level Production ===

I see we are back to "list of maps, separated by blank lines". Great. The
API basics given in the YAML spec aren't explicit on how this translates to
actuall calls. I'd expect that the "next()" calls at the top-level of the
parser will return a map node, one per each "map block" in the file. As for
the emitter, I'd expect that repeatedly calling "begin()" and "end()" on the
top-level cursor will emit multiple map blocks, but the text doesn't seem to
support this. I think that this should be made explicit.

=== Classes and Color ===

Clark made a good case against comments - we can't use them because either
they aren't part of the information model, and won't round-trip; or they
round-trip and hence must be a part of the data model. He's right.

The same argument carries over to classes. The current spec states that if a
class is not recognized, a warning is emitted and the class is ignored. This
is unacceptable - consider a YAML pretty-printer, it will recognize a very
small set of classes, if any. Yet it must preserve the class names.

The problem seems a classical place to use the color idiom. We have a piece
of information - the class name - which we want to attach to any nodes in
the YAML document, including scalar nodes. Some applications (the pretty
printer) aren't interested in this information, but must preserve it. Other
applications (e.g., a YAML-based application server) rely on this
information.

There's no problem for attaching such information to map nodes. Use some
special key for the class info - say, '#' - and you are done. For scalar and
list nodes, the color idiom suggests you wrap them in a map node. Use the
same key for specifying your "color", and another special key (say, '=') to
specify the *value* of the node:

    delivery: %
        = : 2000-JAN-10
        # : date

The trick is that the random-access APIs should allow you to call
"getValue()" on the 'delivery' node and obtain the date value. This is
acceptable to an interface such as XML's DOM. It probably could work in
languages dynamic enough to hack the type system into doing this implicitly:
de-serialize 'delivery' into something which *behaves like* the string
'2000-JAN-10' but is *at the same time* a map with two keys.

Complex? Sure, but Brain said, "don't worry about the implementation" :-) 

Seriously, I think there's cause to worry here. The color idiom has turned
out to be the solution for many practical problems - not the least of which
is schema evolution. Maybe you guys can come up with a better way to do it
then wrapping scalar/list nodes in a map; I couldn't find any. But I
strongly feel that YAML should address this somehow - and that classes are
the perfect way to "eat our own dog food" in this regard.

=== Text syntax ===

Do we really need so many different formats? *4*? I see why we'd want base64
- that's not really a text format, anyway. I can see why we'd want one
format for cut-and-paste and one format for allowing escaping. That would be
the block format and either the string or the simple scalar format. There's
also the issue of multi-line values allowed in a map but not in a list. So
we actually have 4.5 formats (3.5 for text, one for blobs).

Do we really need all of these? Can't we make due with just one block format
and one stream format?

I also strongly dislike the fact that indentation rules are suspended in a
string value. Why is that? I see the pain - one won't be able to, say,
quickly skip a sub-tree by simply looking for a properly indented line;
readability suffers, etc.

I don't see the gain: Cut&paste by itself isn't a good enough answer - if
you are doing it by hand, any editor will easily indent the new block of
text, and if you are doing it by a tool, well just add the indentation while
you are at it.

In E-mail messages Clark said that base64 blocks are indented, but the spec
says otherwise. Which is right? What's the harm in indenting blobs? We
already allow long lines in blocks... in fact, there's no longer a way to
break a long line in a block into two shorter ones.

I realize you had just about enough of this syntax issue, and that I'm fresh
out of a 10 day vacation from it, but I think that either some strong
rationale or some more work on this are due. I'll try to see if I can come
up with any new ideas...

=== E-mail address ===

I'd rather you list me as or...@be... and not or...@ri... since
YAML isn't very related to my day job and also because the other address is
more stable - I hope to keep it for a long time to come, while companies
come and go.

Sorry for the long message - I had a lot of catching up to do.

Have fun,

    Oren Ben-Kiki

Thread: RE: [Yaml-core] yar

yaml-core