RE: [Yaml-core] Plan of battle

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Yet another long post... But I think we are getting somewhere, so it may be
worth reading anyway :-)

Clark C . Evans [mailto:cc...@cl...] wrote:
> I look at it this way:
> 
>   a) We eliminate the short-hand notation so that it is
>      not confusing what is a map and what is a scalar.

After much agonizing, I independently came to this conclusion. In fact, I'd
like to take it a bit further.

Let's finalize the YAML 1.0 spec purely as a file format (plus requirements
from the data model, of course). Drop all the API stuff, the discussions of
color, etc. Give up on the !class shorthand, but we'll reserve all
single-indicator-character keys for "special semantics" to be defined in
separate API specs. Call this spec YAML-CORE. I could probably finalize it
in a weekend (given we settle the binary/Unicode issue).

There are going to be two types of APIs. Let's call them "Native" and
"Incremental" APIs. I suggest we spec them in a separate document from the
core YAML spec. First, we'll close specs faster. Second, including just one
API in the core spec would give it more importance then the other, which
would be wrong. Each of them has its place.

The "Native" API is a simple load/save API using the language native data
structures. We'll spec this quickly (once per language?), and probably
implement it as quickly. We'll end up with a working, very useful, YAML
support for Perl (Data::Denter), Python, C#, Java, JavaScript, and possibly
C++ (based on stl; a bit of a challenge, unless we use RTTI). I suggest that
to make quick progress, we just include a simple definition of v/w in such
specs (as a library function), but not waste too much time hammering out the
details of color behavior. Call this the YAML-NATIVE-API spec.

We then move our focus on defining the "Incremental" API, which is a
(mostly) language-independent push/pull API spec which fully supports color
etc. This would be the YAML-INCREMENTAL-API spec. Much fun here... (I admit
to not catching up on the work you've done there yet).

In a fourth spec we'll go through the color idiom, what the standard colors
are, what standard filters an implementation should/could provide, etc. Call
this YAML-COLOR.

Quite a job, right? But it drives home the point we simply can't try to do
it all in one spec. We'll never finalize it; we'll be debating an
YAML-INCREMENTAL-API issue, say, and in the meanwhile we are not releasing a
YAML-CORE spec - because both are included in the same document.

>   b) We state the obvious in the spec -- adding items
>      to a map will, in almost every case, be a safe
>      addition and unlikely to break programs depending
>      upon its structure.

Agreed. This is a YAML-COLOR issue. Keep it out of YAML-CORE, except maybe a
passing statement.

>   c) We state that promoting a scalar to a map or list 
>      may only be supported by particular programs and that
>      this technique must be used with caution.  We can 
>      underscore those cases where a structual upgrade 
>      will work, when the applications accessing the
>      information:

Agreed. Again, a YAML-COLOR spec issue.

>     1) are written to use the visitor/iterator 
>        interface and not dependent on a particular
>        in-memory representation; or

That is, are written using the Incremental API.

>     2) are written in languages which use a 
>        YamlNode instead of native data structures; or

Just another form of the Incremental API (I should hope). I claim that
YamlNode should be the basis for the push API (the visitor visits such
nodes) and hence the pull API (keep a stack of YamlNode, for example).

>     3) are written to use the (w/v) methods as 
>        described.

That is, using the add-on from the Native API. Yes.

>   d) We also talk about the "standard" colors, such
>      as class, comments, classes.  And we discuss the
>      base filters, in particular, how the standard
>      colors can be added (with approprite promotions)
>      and used with unexpecting older code if the
>      appropriate stripper is used...

Definitely YAML-COLOR.

> Overall, I think it is important to underscore that the
> w/v mechanism is optional.  Most users of YAML will 
> use them if they are "libraries" and don't have control
> of the YAML.  However, one-offs probably won't.  Proper
> filters, of course, would have to be written robustly.

I'd want v/w to be a "must" part of the Native API, in the sense that "you
aren't YAML if you don't provide them" (they are trivial enough to
implement). But they are clearly optional in the sense that you don't have
to use them.

> On the subject of encoding.  I'm starting to like the
> idea of using a ^ indicator for this purpose.

OK, wait a sec. You've written:

> I was taking a walk with my g.f. and discussing this item,
> it seems that there is a triparted situation:
> 
>             | Java / C#  | Python  | Perl                       |  
>   ----------+------------+---------+----------------------------+
>   ASCII     | String     | String  | Scalar (UTF8 or Otherwise) |
>   UNICODE   | String     | Unicode | UTF8 Marked Scalar         |
>   BINARY    | Byte []    | String  | Not UTF8 Marked Scalar     |

So, if I'm reading this right, you are suggesting we distinguish between
"ASCII text" and "Unicode text". In the information model. Hmmm.

Let's see. In Java/C#, being modern languages, there's no problem using just
two types. That's obviously the "right thing to do". Doing three types would
complicate things for them - how would you round trip an ASCII string via
Java? Forcing such strings to become maps is the wrong answer :-)

In Perl/Python, being older languages, there's an issue using just two
types. This ties in with their problem of them supporting Unicode in the
first place. Now, if by doing a three-way split we could have avoided using
heuristics - the system would always do the right thing on both read and
write - I'd consider it. But the table above shows that on minimum, on
output you must use heuristics to distinguish between binary and ASCII.

So, we must rely on heuristics to "do the right thing" in most cases. We
also must provide some mechanism to override the heuristic if it guesses
wrong. E.g., the printer will get an optional list of path expressions which
must be printed as binary or ASCII.

In Python (and only in Python) there's also the problem of reading. A
reasonable heuristic would be to de-serialize any purely-ASCII text as an
ASCII string ("use Unicode only when you must"). Again, the API must provide
a way to change this heuristic (to "Always Unicode for text"), and possibly
a way to override this on a more fine-grained level (a list of path
expressions would make sense).

I predict that with time the problem will slowly fade away. Perl and Python
will come to terms with Unicode. So, with time the accuracy of our heuristic
(16-bit == Text, 8-bit == binary) would rise. The "override" capabilities
would be used less and less, and eventually we'll put the problem behind us.

If, on the other hand, we enshrine the concept of "ASCII string" in the YAML
information model, we'll be stuck with it. For example, whatever tricks
we'll do in the Java/C# API to support it will haunt us forever.

In short, the issue is ugly. I'd rather concentrate the ugliness in one
single, well-defined point (using heuristics in the parser/printer of
Python/Perl) so that it doesn't harm anything else (our information model,
the Java/C#/future languages APIs, etc.).

If we agree on the above, there are only two (easy) issues to settle to get
the spec to a "pre-release" state (pre-release for me means we concentrate
on polishing issues - language, examples, formatting, etc.). The two issues
are whether we define a canonical form or not, and the details of the 32-bit
characters.

I've no problem with defining a canonical form (4 space indentation,
multi-lines values start on a line of their own, use the simplest possible
text style, put a single space between the key and the following ':', one
space between the ':' and the value, use "most folded" white space for text,
break lines after 76 characters but include at least 32 text characters, use
prefix ':' in lists only for multi-line simple scalars, align multi-line
quoted string text by one space as in our examples - I think that covers
it). The problem is that at least some of these decisions are a matter of
taste. It would be different if we defined the canonical form as "the form
with the minimal number of characters" - but then it may not end up to be
very readable. And of course we may simply not define one at all. Thoughts?

As for 32-bit characters, I haven't done my homework on that. What does this
mean, exactly? Is it OK to say that YAML uses 16-bit characters and that the
32-characters are always represented as a "surrogate pair"? So far I avoided
dealing with these monstrosities - 16-bit Unicode seemed quite enough, even
when dealing with Japanese clients. Clark, can you handle this issue?

Have fun,

    Oren Ben-Kiki