RE: [Yaml-core] a 4-stages 3-process model

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

It is difficult to plunge into such a discussion and answer each previously
raised point individually. I had a night to sleep over this and had many
ideas, most of which you guys have raised.

I started thinking that forcing the progression from syntax to native model
into a linear path just won't do. This has been raised in several posts
(e.g., the "cube" notion). Clark's representation is almost exactly what I
had in mind:

>              SYNTAX
>                |
>              EVENTS
>             /      \
>         TYPED      UNTYPED
>         EVENTS     GRAPH
>            \       /
>              TYPED
>              GRAPH 

I presume from the above that the native data structures of an application
are considered to be a "typed graph". Correct?

Then thinking about it some more I had some doubts. These focus on
TYPED-EVENTS. It isn't clear to me that this is (1) useful in any way; and
(2) that it is logically consistent.

Usefulness: I can't think off the top of my head of any code that I'd want
to run at that level. Can anyone come with an example that would want to do
so?

Consistency: If I have a collection that contains an alias to a node, how
can I realistically "type" it? Certainly I can type _scalars_... But
collections seem to require tracking all contained alias nodes to their
referents - that is, using a graph.

Hence above tangle is "inevitably" converted to the neat progression:

  SYNTAX
    |
  EVENTS
    |
  UNTYPED
  GRAPH
    |
  TYPED
  GRAPH

Many of our problems about YPATH etc. were caused by the lack of an
UNTYPED-GRAPH model in the current spec. The current spec looks like:

  SYNTAX
    |
  EVENTS
    |
 (TYPED)
  GRAPH
    |
  NATIVE

Now, NATIVE is just another way to represent a typed graph (as has been
pointed out in some posts) and it isn't really a separate model (which was
another problem with the current spec). If you look at our discussions from
a certain angle, it is almost all bout whether the graph is typed or not; in
fact the core of my "unknown types" proposal was that it is _not_ typed.

Perhaps we can continue to use the names SYNTAX, SERIAL, and (untyped)
GRAPH, and rename NATIVE to TYPED:

SYNTAX -(parser)-> SERIAL -(deserializer)-> GRAPH -(loader)-> TYPED
TYPED -(dumper)-> GRAPH -(serializer)-> SERIAL -(emitter)-> SYNTAX

I think that nicely captures the differences in semantics of each model. I
also think that it is consistent with everyone's proposed models (Tom's
included :-).

It is also accepted by all that any specific piece of code may use
information from any level it wishes to. It is OK to write a YPretty that
provides some type-specific processing on the one hand (e.g., reformats
integers from decimal to hex), while at the same time explicitly preserves
key order and throwaway comments.

So far I believe I expressed a "consensus" reached in the list. Now all
that's left is the "small" issue of the definitions of the separate models.
The problem is these "small" differences have "big" consequences later on.

I thought that to discuss this in any concrete way we need to do so in the
context of applications that make use of particular models. This turns out
to be very hard to do, because our two canonical examples straddle models.

YPretty works mostly at the SYNTAX level; it is aware, for example, of
comments, how long lines were wrapped, how characters were escaped and so
on. On the other hand, YPretty may reach as far as the TYPED model, e.g.
convert the format of integers or of Booleans.

YGrep works mostly at the TYPED level. It needs to be able to do
type-specific comparisons. However, when a type is not "known", it must
resort to using the GRAPH model. Also, YGrep may not be beyond reaching down
as far as the SERIAL model (e.g., it is a nice property of a YGREP to emit
matched keys in the same order as they were given in the input text file).

In fact, the only piece of code that we can say anything clear about is the
"application itself" that works in the TYPED model and _only_ in the TYPED
model. As a result we have two anchor points - the SYNTAX and the TYPED
model - and anything else is subject to debate because YPretty and YGrep can
do their work "anyway".

Another way to try to clarify what is where is to focus on the modules that
convert between the models. That turns out not to be a big help either;
there's nothing "obviously" wrong with the parser doing typing according to
regexps or the loader doing so (to give an extreme, but very relevant,
example).

A weaker way to resolve issues is to consider any syntactical effects they
may have, and decide according to our goal priorities: readability first;
interaction with scripting languages second; using native data structures
third; and so on.

I think we all mostly agree on readability; the syntax we have is pretty
stable so that minor changes to it are easily labeled as more or less
readable (e.g., () around integers are obviously less "readable"). The
second and third goals require we determine the target scripting languages.
We've focused on Perl and Python, with me watching out for Java, and we now
have Ruby and PHP folk as well, which is a pretty good cross-section if
languages, I think.

So. Enough chatter. I want to respond to Clark's 4-model proposal in the
context of all the above, especially the latter point.

> [SYNTAX] -> [EVENTS] -> [GRAPH] -> [NATIVE]
>      (parser)    (loader)    (binder)
> 
> [SYNTAX] <- [EVENTS] <- [GRAPH] <- [NATIVE]
>     (emitter)    (dumper)    (un-binder?)

I suggested above a rename as follows:

[SYNTAX] -> [SERIAL] -> [GRAPH] -> [TYPED]
     parser   deserializer    loader

[SYNTAX] <- [SERIAL] <- [GRAPH] <- [TYPED]
       emitter    serializer  dumper

Since "native" is one way to implement "TYPED", this keeps the loader and
dumper where we all expected them to be.

> [SYNTAX]
>    The syntax stage represents a YAML text file or a 
>    data from a socket.  It must follow the productions
>    as given below.   Included in this information is:
>      - declarations  (#YAML:1.0 for example)
>      - comments      (# something or other)
>      - encoding      (which encoding was used)
>      - styles        (which syntax forms were used)

Syntax also includes where long lines were broken and what indentation was
used and which escape sequences were used and so on. Think of syntax as the
result of running a lexer on the text file. The next step is obviously:

> (parser)
>    The parser reads information from the SYNTAX stage
>    and produces EVENTS, which is a discrete serial
>    representation of the information.  

I think SERIAL model is a better terms. "Events" smacks of a particular
implementation technique, and not the preferred one at that; I'd much rather
use a pull parser than a push one.

> [EVENTS]
>    Each YAML document is reported as a Node.
>    A node is either a scalar, sequence, pairing, or alias.
>    A scalar is a series of zero or more unicode characters ("string")
>    A sequence is an series of zero or more nodes
>    A pairing is a series of zero or more pairs, where a 
>      pair is exactly two nodes, a key and a value.
> 
>    Each scalar, sequence or mapping node also has a transfer string.
>    For those nodes without an explicit transfer in the SYNTAX 
>    representation, the following rules are used:
>    - If the node is a sequence, it's transfer is !sequence
>    - If the node is a pairing, it's transfer is !pairing
>    - If the node is an unquoted non-block scalar, its 
> transfer is !unknown
>    - Else, if the node is quoted or block, its transfer is !string

Here I disagree (one of these "small" issues :-). I know I agreed with the
above in the past, but after some reflection I think that a better way is:

     - If a scalar node has no explicit format, the format is "implicit".
     - Collection nodes have no format.
     - If a node has an explicit type family, that's it. Otherwise:
     - If the node is quoted or block, its type family is !string
     - If the node is a pairing, its type family is !implicit-pairing
     - If the node is a sequence, its type family is !implicit-sequence
     - If the node is a scalar, its type family is !implicit-scalar

This is a big deal, really, because it means that just like the loader can
use string-regexps to decide how to type implicit scalars, it may use
ypath-exps to decide on how to type collections. I see no big difference
between saying that \d+ is an implicit integer and saying that a map that
contains only x and y keys is a point.

This is a marked increase in readability, since it greatly reduces the need
to provide an explicit transfer. On the other hand, it places a greater
burden on the typing-schema (see below).

>    Each scalar, sequence, or pairing may have an anchor string
>    
>    An alias has an anchor string, such that at least one node
>    occurring prior in the stream has an equal anchor.
>   
>    Each node has exactly one parent with which it participates.
>  
>    Notes:
>      - The pairing preserves key order
>      - The pairing allows for duplicates
>      - These are the minimal elements, more
>        informational elements from the SYNTAX
>        model, such as style may be included

Well, yes; _all_ models define the "minimal information" and libraries are
welcome to provide more.

>      - You can think of this as similar to XML's SAX

Exactly.

> (loader)

Rename: de-serializer.

>    The loader's reponsibility is to convert the EVENTS
>    into an in-memory GRAPH.   It does this by replacing
>    anchors with the most recently referenced node with
>    the same anchor.
> 
> [GRAPH]
> 
>    This stage is similar to the EVENTS stage only that:
>    - the alias nodes are not present
>    - the anchor property need not be present
>    - the parent property is not present
> 
>    Notes:
>    - each node is allowed to participate in more
>      than one pairing or sequence
>    - You can think of this as similar to XML's DOM
> 
> 
> ===> Here down is only interest to those doing
>      native serialization and transformation
>      applications with YAML.

That's a shade too restrictive, unless one views YGrep as a transformation.
Hmmm. I guess it is at that.

> (binder)

Rename: loader.

>    The binder takes information from the GRAPH and
>    resolves all types.  The result is a graph using
>    data types NATIVE to the given system.

Rename: TYPED.

OK, this is subtle. "By definition" all the data returned from any YAML
library at any level is native (you can't write a C library that doesn't
return a C data type, even if it is "void*"). In that sense the above is an
empty statement.

The idea is that the representation _may_ be just the normal native data
structures used by some application. It may also be some sort of YamlNode.
It may be a combination (e.g., when a mixture of known and unknown types are
used). Hence there's no need for a separate "functional" vs. "native" model;
the current draft has this distinction (generic vs. native) and it serves no
useful purpose.

>    - If a given node's transfer is !unknown then 
>      a warning is issued to the user, and it is
>      treated as if it were a !string.   This 
>      warning can be supressable, and need not
>      be duplicated for each node..

!implicit-scalar => !str
!implicit-pairing => !map
!implicit-sequence => !seq

Hmmm. We have 'pairing' vs. 'mapping', but use 'sequence' across the levels;
we should either find two names for it, or use a single name for mapping.
I'm not certain using two names doesn't confuse the issue. We've had a phase
where we had a separate named for each level and it didn't work too well.
I'd stick with 'mapping' all the way.

Another small point: if the format of a scalar node is 'implicit' it is
converted to 'canonical'.

>    - If a given node's transfer is not known by
>      the binder (the binder is unable to create
>      a native representation); then an error
>      condition occurs and the given text does
>      not have a NATIVE representation.
>
>    - For each pairing, if the set of key nodes
>      contains a duplicate; then an error condition
>      occurs and the given text does not have a 
>      NATIVE representation.

The following is a bit of a leap. It refers to alternative ways to handle
!implicit-<kind>. So first we need to say that a loader is allowed to
convert implicit type family and format to explicit type family and format.
The following refers to constraints on how it does that:

>    - When constructing a NATIVE binding for a scalar
>      node, the conversion must be only a function of
>      the node's character content and transfer.

That is, deciding on the type family and format of scalars can only be
driver by regexps, _not_ by the path to the node. The equivalent for
collection nodes is that deciding on the type family may only contain on the
content of the node, not its path.

Hmmm. I see where this is coming from; it makes implicit typing more
manageable. The problem is that I can't say that, e.g., the 'zip-code' field
is a string no matter what; if integers are used anywhere in the document,
I'd have to quote the field. I'm neutral here right now; I think we need to
discuss this a bit, though.

>    - When constructing a NATIVE binding for a pairing
>      node, the relative order of the pairs may not
>      be taken into account, nor any other information
>      other than the nodes which are paired.
>
>   Notes:
> 
>     - This is the built-in binder for a YAML
>       system and is optimized for "object serialization"
>       of common languages like Perl and Python; it is
>       an 80% solution.

For the PHP people: my proposal above that you can implicitly type
collections means that you can de-serialize any node that is a sequence of
one-key pairings into a PHP thingie (what's its name, anyway?). No !omap
required. Isn't YAML great?

And it doesn't harm interoperability. Perl will load it into a list of
hashes, but what else can it do? Unless you add an implicit rule there to
load it into some specialized class as well.

>     - The user may provide their own binder, to 
>       transfer into a application-specific structure;
>       for example, into an XML DOM or native C structs.

The output of the loader isn't specified anyway (other than "some native
data types"). The one constraint is that it should be possible to extract
the type family and the same canonical value from the resulting native
object. That's all.

Given the restrictions you said above, you could just say that the loader
may be given a list of regexps (and y-exps) to drive the implicit type
family and format resolutions. That's all the customization you really
allow.

>     - The loader and binder can be combined into
>       a single step as an optimization to skip the
>       creation of a generic graph intermediate. 

Yes.

> [NATIVE]

Rename: TYPED (see above).

>   The native representation for YAML is a graph of
>   native objects.  Objects are either Scalar or 
>   Collections.  Collections are functional, "mappings".

The thing is, it is a requirement that the type family (and the canonical
value) should be implicit in the object; that is, that it should be able to
"get" them somehow. Each type family implies a single kind
(mapping/sequence/scalar). Lump collections together as much as you want,
I'd _still_ be able to implement an isMapping and isSequence function. So it
isn't that I disagree as much as it is simply impossible given the previous
models.

But you knew that already :-)

Perhaps it is a good point to say a word about type families. Basically,
they are the same as in the current spec; each implies a kind, scalars have
formats (canonical, implicit and explicit). A new feature I suggested above
is that collection type families have a single y-exp that may be used for
implicit typing. Since this is new and has implications, we'll need to
discuss this a bit; I think it is required for symmetry and that it solves
many problems (e.g., the PHP one). A note: much depends on what this y-exp
can check; I expect it at minimum to be able to specify the kind and/or type
family of the sub nodes, as well as key values for mappings.

>   Note:  Many YAML tools such as YPATH, YSCHEMA, YTRANSFORM,
>          will work on the Native binding of YAML data.
>          The strict functional requirement will be 
>          assumed by these tools.

Work on the TYPED model - for sure. I don't think they'll be all strict
about it, however. I gave the example of YGrep for an integer in the
presence of an IP address in the file; I think that should work fine.
Obviously validation would be strict, but in general such tools will be able
to employ the GRAPH model when types are unknown.

OK, I have to rush to a meeting. Back in an hour or so.

Have fun,

	Oren Ben-Kiki