Thread: [Yaml-core] scalar formats implies 4 models (or we can have 3)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I think I have a few "insights" here, so please read. ;)
To review, there are two items which we have agreement:

1. We will remove format from the syntax specifications, 
   simplifying the productions to allow both scalars and 
   collections to use #fragments in their tag uri's.  This 
   will better match implementaions and allow for version
   numbers to use #fragments.

2. While we are editing, we will change 'taguri' to 'tag',
   to follow the underlying taguri spec we are relying upon.

Let me propose 'tag' for the name of this new !thingy, as
we were having problems picking between "type family" and
"transfer method".   I like tag beacuse it goes with the
taguri spec, but is short, clear and has less baggage than
the other two names (giving us the opportunity to define
what it means more clearly, see below).... it is also just
that, a 'tag' which labels a serialized node with typing
information.

...

Onto the core debate...  what this change did is re-open the 
"model" discussion that we never really successfully closed. 
Let me start with a primer.  We have 4 core models, and the
open issue is about if the fourth model belongs in the spec
or not.   Here are the first three models, and this should
mostly be concensus (I hope):

   syntax    This is the model that closely relates to the
             actual productions.   A tokenizer would have an
             API equivalent to the syntax model.  In this model,
             one is concerned about line breaks, comments, 
             indentation, styles, and other issues relating to 
             human presentation.

   serial    This is the model of YAML as it is pushed through
             a thin pipe in chunks; it breaks scalars into chunks
             and imposes an order on mapping keys.  This model
             explicitly lacks presentation items such as style,
             indentation, and comments.   A parser would have an 
             API substantially similar to the serial model.

   graph     This is a YAML model where an entire document can
             be put into a random access storage.   In this model,
             scalars are unified wholes, mapping keys are unordered.
             However, the actual scalar data is still "strings"
             with a !type-marker.   A generic YAML node interface,
             either having its own storage, or acting as a wrapper
             over native data types would closely match this model.

The big source of contension is to the existance of the fourth
model.  It has been called 'typed', 'native', and several other
names, none of which really capture the essential aspect of the
model... that is, complete awareness of a given !tag by the system.
Let me propose a fresh name for this model, 'target' since this
is the term often used for platform-specific (aka type aware)
assembler linkages.   So, without further ado:

   target    This is the capstone YAML model which is similar to
             the graph model, but implementations will have 
             explicit knowledge of each tag used, and its
             interpretation within this system.   This is often
             called the "native" model, but I am using "target"  
             model here to include systems where the knowlege
             in the system exists to create a canonical form of
             a given data type (see below).   Certainly native
             bindings do this, but other targets could exist;
             a random access, canonical form, generic YAML node
             could also match this model.  The target model 
             extends the graph model to provide strong-equality.

The primary rationale for having models is to let us flesh out our 
vocabulary so that we can better communicate about implementations,
and potential usages (or bad uses) of YAML data.  A secondary, but
still very important rationale, is to provide a clear interpretation
of YAML information so that generic tools, such as a path or schema
language can be developed for YAML without having huge intellectual
weeds growing in our garden.

This latter issue, strangely enough, is best seen through the notion
of node "equality", which is essential to define correctly if one
is to talk about a ypath or schema (without node equality, a path
is quite useless); or even the "obvious" constraint that a mapping
may not have two keys which are equal.  For XML this is simple, nodes 
are equal if their string content is equal.   However, this does not
work as well for YAML, since YAML information is typed.  And, to allow
for human readability, YAML types are further complicated by necessarly
allowing various ways to write a particular type (aka format).

I like to think of this as "weak" vs "strong" equality.  For weak
equality we can only say if two things are equal, we cannot say if
they are inequal.   For example:

   RHS        LHS         weak-equality   strong-equality

   !int 11    !int 11     equal           equal
   !int 11    !int 12     unknown         not equal
   !int 11    !int 0xB    unknown         equal

At best, what the type-ignorant "graph" model can give you is
weak-equality, that is, one could have a mapping 
    { !int 11: eleven, !int 0xB: eleven } 
and there is no way to know that this is invalid YAML unless 
the implementation has detailed awareness of the !int tag and 
its formats.  Specifically, it would have to know that 0xB is
the hex format for the number 11, and therefore the two values
above are actually equal, even though they are serialized 
with different characters.   

Therefore, a critical constraint of YAML (mapping keys are 
not equal) requires strong-equality.  While weak-equality is
close, it is just not good enough.   So, in this case, we
must have a 'target' model in the YAML specification.  That is,
either a native binding or a canonical form for each scalar
value is needed for YAML and any tools build on top of YAML; 
else we will have weeds and thus bugs in our garden.

...

So.  This actually makes the decision straight-forward.  
We have two options:

   1.  We allow for data types to have more than one format
       (even though the format is not in the !tag) and thus
       must have 4 models  (graph != target in this case)

       Good: we have the flexibility of allowing integers
             to be written in different formats, 0xA and 11 ;
             dates as '10-JAN-2002' or '2002/01/10'.

       Bad:  this gives us one additional model, and the
             extra descriptive complexity that goes with it

OR,

   2.  We state clearly that string comparison for scalar
       values having the same !tag is strong-equality.  And,
       in this case, data types (via a single !tag) may only
       have one format.   In this case we only need 3 models.

       Good: we only need 3 models, and we can validate 
             a YAML document and say that it is compliant
             without needing external type-specific knowledge
             (the method to make a canonical form)

       Bad:  different 'styles' for writing scalars will
             have to be handled by the application programmer,
             ie !int/hex, !int/oct are different types as
             far as YAML is concerned.

Thus far, we have done a great job providing a framework where
'content' and 'style' can be cleanly separated.  It would be
very nice to continue this trend over into scalar formats; but
I am not sure if it is worth the complexity -- perhaps this is
the point where we should let the application programmer deal
with the complexity?

The 'graph' vs 'target' model distinction will have a tangable
implementation:  each YAML system will have a 'type registry'
where !tags will be put into the system, probably with either
a method to make the string 'canonical' or to provide a 
native node with equality operator.    Also, if a given !tag
used for mapping keys is not known by the implementation, it 
will be impossible to verify that the YAML document is indeed 
valid.  However, given that YAML systems are _already_ doing 
this, the actual implementation is not much of an additional 
burden (it is part of what we already do).  Thus, the hard part 
will mostly be explaining it in the model.

I am leaning towards 'four' models again.   But not out of 
just being tipsy -- but rather since I now see a clear separation
between the between the 'graph' and 'target' that is motivated
by tangable "operational" needs; that is, if a YAML document
is invalid, it will not be loadable by some systems (due to
duplicate keys).  And this clarity comes only with bucketing 
canonicalization and building native nodes together as doing 
the same abstract operation -- providing strong equality.

Ok.  I hope this helps to clarify things.  

Best,

Clark

Thread: [Yaml-core] scalar formats implies 4 models (or we can have 3)

yaml-core