Thread: [Yaml-core] And Now For Something Compleatly Different

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Well not really completely different but...

Yes Clark Oren and I had a wicked debate on irc. They have tried to
represent me in the follow up posts, but have not really captured the
vision of where I want this to go. I'll try to be explicit and blunt
about my ideas in this post, although when I do this, things still seem
to get misinterpreted. But here goes anyway...

I'll just start by making some assertions in no particular order:

1) A YAML *document* cannot by itself contain all the semantics that all
   applications need for successful processing.

Well they could of course, but then it would not be YAML. YAML is nice
because it is clean and easy to read. There is just enough information
there that a person can read the document and say, yeah I see where
this is going.

2) To make a YAML document semantically complete in a given context,
   requires a thingy we refer to as a "schema". This is a body of code
   and or text that contains all the hints to add onto the information
   in the document that completes the YAML Information Model.

A schema can be as simple as the instructions in a Python program that read
config file:

    ---
    name: Brian Ingerson
    age: 21

and know that age needs to be an integer.

3) A YAML *parser* knows nothing about typing and tagging. It does not
   cook any data that it parses. It just reports it all to the consumer
   of the parse events. That consumer must be "schema aware" in order to
   know what to do with the document. Again, this can be simple or
   complex. But all the complexity is outside the parser.

4) Implicit typing is all done outside the parser. The parser just
   passes along enough contextual information so that the consumer can
   make the right choice. When I say consumer, btw, I am thinking
   "Loader", so I'll use that term.

5) There are many many many applications where the producer of a yaml
   document is the sole consumer of it. This includes most
   serialization. YAML is after all a serialization language primarily.
   In this case it does not matter what the schema is per se, as long as
   things round trip adequately. (NYN)

6) There are applications like config files where there is no mechanical
   producer, just a single application consuming them. The application
   is usually so specific that the schema is just hardwired into it.

7) There are other applications where documents are produced by one
   source and diseminated about the universe to be consumed by other
   processes. This seems to be the major thinking of XML heads (like
   Oren and Clark). In this case, it makes more sense for the documents
   to have a link of some kind, to a formal schema language that is
   readily understood by any potential consumer.

8) There are other cases where YAML is neither produced or consumed by
   machines. Like in friendly emails to other YAML speaking people.
   These use cases aren't really important though for this discussion.
   If the sender messes up the YAML the reader will likely still get it.

9) Tagging nodes in a YAML document is only important because we can't
   resolve schema ambiguity without them. If we could, I'd drop them like
   a lead brick. Tags tend to add a lot of noise to YAML. YAML is about
   eliminating noise.

10) That said, tags can be minimal. They don't need to be globally
    unique until they are comsumed by the loader. Until they are ready
    to be judged against a schema to determine what they really mean, So
    a tag only needs enough info to clearly disambiguate a node of one
    meaning from a node of another.

11) The prefix systems being discussed currently are used to make tags
    globally unique in the document. Not only is this overkill, it's not
    necessary. I also tend to think it's not even desirable.

12) Oren stated that there needs to be at least one piece of information
    in the document that states how to globally resolve everything. I
    agree with that notion. There needs to be the concept of a GUID.
    Global Universal ID. This is the only piece of information needed to
    resolve everything in the document. 

13) A Parser must now report the GUID to the Loader. This is a new
    concept. Clark would have you believe that a parser cooks info from
    the GUID into each tag. I say no way. A parser reports the GUID and
    all the other simple events. Period.

14) Say it with me 3 times. ALL SCHEMA RESOLUTION HAPPENS IN THE LOADER.

15) A document can have 0 or 1 encoded GUIDs. In other words you can
    specify it or not. If you don't then the parser just fires off a start-
    document event with an *empty* GUID. This is valid. Why?

16) Many applications don't need the burden of an encoded GUID. The GUID
    is implied by the application. The serializing program doesn't need
    it to round trip things. It is just set up that way. The program
    isn't *sharing* the document with anyone else. The Loader that
    receives an empty GUID needs to decide for it self which action to
    take. Many times it is already programmed to not need the GUID.

17) Other Loaders absolutely need a GUID. A GUID is just a key to
    finding all the missing info that is needed to load and consume the
    document. If this type of Loader receives no GUID it must raise an
    exception. This would include loaders that consume YAML syndication
    feeds etc. The XML-head stuff.

18) We don't need any directives at all. All extra info including YAML
    version can be found at the other end of the GUID rainbow. 

NOTE: I could be wrong here. Directives are typically consumed by the parser.
At least that was the intent. %YAML %TABWIDTH, etc. I just don't think we need
any of these right now.

19) The taguri scheme is an acceptable encoding for GUID. We should probably
stick with it.

There can be all sorts of different ways that programs discover a schema,
given a GUID. Including a global registry. This is nice because it can
indicate when two GUIDS are actually compatible.

20) This would allow completely clean YAML. Even when we need to add tag
    crapola. Here is how dirty it would get:

    --- %ingy.net,2004/shiznit/%
    name: ingy
    birthdate: !date March 25th, 1964
    last sneezed: !iso/date 2004/04/01T01:02:03

The tags 'date' and 'iso/date' are simply unique within the document. It
isn't until load time that they are expanded/validated against the schema
implied by the GUID 'ingy.net,2004/shiznit/'. 

What does the schema in question look like. Nobody knows for sure at this
point. But it is definitely aware of the differences between 'date' and
'iso/date', whatever those differences are.

21) Note that I abandon the directive syntax altogether. That's because
    I feel that this is the only directive needed. In actuality it make
    good sense to just use the directive syntax though:

    --- %GUID:ingy.net,2004/shiznit/

22) Let me reiterate! The GUID is *not* concatenated onto the tag by the
    parser. There is NO RESOLUTION IN THE PARSER.

23) Yes the YAML is always ambiguous within the document alone. That's
    the whole point. We can't fit all the semantic information in the
    document without ruining YAML so why even try? A YAML document only
    makes sense at the time it is consumed in a specific context. We can
    force the strictness using a GUID and a formal schema language if we
    want/need to.

24) The cool thing is that we can consider any tag in use today as
    valid. Messy but valid. Something unique to be resolved by a schema
    aware loader. Just remember the tag isn't really globally unique in the
    document. It needs information alluded to by a GUID.

25) This make backwards compatibility doable. So we get the best of
    all worlds.

OK I'd better quit for tonight. Very busy schedule this week. Thanks for
your time.

Cheers, Brian

Thread: [Yaml-core] And Now For Something Compleatly Different

yaml-core