Thread: [Yaml-core] Of Tree Trunks, Snake Tails and Elephants

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Machine Processing and Human Presentation

YAML information can be viewed in two primary ways -- as data 
structures for automated processing ("machine") and as a sequence
of nicely formatted characters representing this data ("human").
A YAML processor is a tool for converting information between 
these views.   This section examines components of YAML information
through the union and intersection of these perspectives.

                      +---------------------------------------------------|
                      | MACHINE PROCESSING                                |
                      |                                                   |
+---------------------|..............................+                    |
| HUMAN PRESENTATION  |  HUMAN/MACHINE OVERLAP       :                    |
|                     |                              :                    |
|                     | YAML is a graph structure    :  Adds data type    |
|                     | composed of sequences,       :  information to    |
|                     | mappings and scalar nodes.   :  each node.        |
|                     |                              :                    |
| Adds formatting     |   Scalars are an ordered     :  Scalars default   |
| information to      |   set of unicode characters  :  to a string.      |
| each node.          |                              :                    |
|                     |   Sequences are an ordered   :  Sequences are     |
| Adds key order to   |   set of nodes.              :  often an array.   |
| mapping nodes for   |                              :                    |
| better reading.     |   Mappings are an unordered  :  Mappings are      |
|                     |   set of key/value pairs     :  usually loaded    |
| Comments can be     |   where duplicate keys are   :  as hashtables as  |
| used for helpful    |   not allowed.               :  class objects     |
| human-oriented      |                              :                    |
| details.            +---------------------------------------------------+
|                                                    |
+----------------------------------------------------+

When moving information between these two domains, elements common to
both domains should always be maintained.  That is sequences, mappings,
and scalars should survive conversion.   However, the machine processing
representation of this information may not be able to maintain the 
presentation components, such as key order, comments, and style.  In a
similar manner, for readable text, it is frequently desirable to omit data
typing information which is often obvious to the human reader and not
needed.   This is especially true if the information is created by hand,
expecting humans to bother with data typing detail is optimistic.

There are two general approaches for "round-tripping" information between
these two domains, preservation and using schemas.   At a cost in ugliness,
data type properties required for processing can be preserved in the 
YAML text file by using an explicit syntax.  In a similar manner,
presentation components such as key order and style, can be preserved in the
processing domain using shadows, decorators, or other approaches.
Shadowing puts the presentation aspects of a node in a lookup table keyed
by the native objects' memory address.  Decorators wrap a native object
to hold key order, style, or other properties and forward on all requests
to the underlying object.

The general problem with preserving is that humans may forget to add the
appropriate type identifiers, or a machine process may add, modify, delete
or move nodes in such a manner that key order, style and other preserved
information becomes out-of-date.   Certainly this can be solved by making
the automated process be aware of the human domain node properties, but
this may complicate things and unnecessarily tie human presentation concerns
to the processing requirements.  Therefore, while the above are good 
solutions they require each domain to hold information needed only by
the other.

An alternative round-tripping solution is to keep ordering, style, and
typing information in an shared external schema.   In this way, when 
converting from a human presentation to the machine processing domain, 
the schema can be used to provide typing information which may be absent
from the presentation-oriented text.   Likewise, when converting from a 
native machine representation, the schema can be used to restore human
presentation properties.  Since both domains share the same core model
of mappings, sequence and scalars, a schema can leverage this common
information to determine proper placement of missing properties.   The
downside to a schema is that it requires external information, so that
either representation is not self-contained.

YAML's Core Model

An essential aspect of YAML is that it has a core model of mappings,
sequences, and scalars.  This allows for a generic tool set including
such items as:  a path-based node selection tool ("ypath"), a schema
tool used for filling in presentation or processing specific properties
or verifying that a YAML text or internal structure has a particular 
shape ("yschema"), or a transformation language for converting one
graph structure to another ("cyatl").

This core model of mappings, sequences, and scalars was chosen 
for three reasons.  First, they are common data structures found 
in most modern scripting languages -- or, if they are not build-in,
most languages have the ability to simulate them quite easily.
For example, in Python mappings are the dict type, sequences are
the list type, and scalars are the string type.  Likewise, in Perl
the mapping is the Hashtable type and sequence is the array.  The
YAML mapping does not have ordered keys and does not allow duplicates
as most of these native data types have the same restrictions. 
Further, in most languages a class object can be viewed through
YAML rose colored glasses as a typed mapping with each of the 
class's properties treated as a key.  Likewise lists of objects
can be viewed as a sequence of typed mappings.

The second reason is that mappings and sequences are functional,
that is, they are an example of a mathematical function.  Although
this is an academic argument, at the core of mathematics is 
concept of a function and there is not much one can do in mathematics
without it.   For example, most of algebra and category theory, the
basis of modern type theory rests firmly upon a functional abstraction.
In mathematics, the domain (keys) of a function is not ordered and
does not admit duplicate values.  Thus, YAML has this same restriction,
which happily coincides with the restrictions found in concrete 
programmatic constructs such as hashtables and arrays.

The third reason is pragmatic.  YAML had to pick some set of structures
for its primitives.  If there are too many primitives, YAML seems
complex and it complicates tool building.  If there are two few 
primitives then YAML lacks expressive power.   Mappings and sequences
were chosen because they combine very well to make other structures.
An alterative choice (one chosen by XML) is a list of pairs.  While
this may seem simpler because it is one primitive, it is actually quite
a complicated option.  It causes YAML programs and derivative standards
(such as YAML schema) to deal with the complexities of pairings when
they just want a sequence or deal with ordering and duplicates when
they just want a mapping.   In short, mappings and sequences are far
more common than a list of pairs and a list of pairs is easily 
constructed as a sequence of mappings.

YAML's Semantic Restrictions

Since YAML information has simoutaenous a machine processing 
representation and a human presentation image there is a risk of 
properties specific to human presentation to bleed over into the 
machine processing domain.   While it may seem obvious to some that
using comments, key order, style, and other aspects of the presentation
domain to drive processing is a bad idea; this may not be clear to all.

As described above, YAML's core model has a mapping which does
not use key order and forbids duplicates.   Unfortunately, these
constraints cannot be enforced by a sequential-access parser which
will not only report key ordering but also allow duplicates.  Thus,
there is significant risk that a new YAML user may not grok this 
subtly and create application which uses key order not just for
presentation, but also to drive application logic.  This situation 
may be worsened if a node wrapper approach is used to preserve
presentation key order.

This may not be a problem if the application is isolated and the
author of the program completely controls the access to the created
files.  However, a core adoption driver of human-readable serialization
is that they can be read and used in ways which the original author has
never intended.  As such, it is often the case that a YAML data
file may be used by programs which the original author has never
indented; and this indeed is a good thing.

Therefore, it is recommended that a YAML distribution be clear about
this restriction and encourage application writers to respect it even
though the parser itself won't be able to enforce.  For example, 
it could call a shadow mechanism "presension" or a wrapper object
could export the key order labeled perhaps "presentation_order".
Ideally, a schema based approach could be used to help maintain
key order instead of relying upon preservation techniques.

This restriction on key order is based squarely on intent.  A litmus
test to ask is: "If an intermediate processor messed up the key order,
could my application fix it so that the user is not distressed by 
the presentation".  For example, consider the following YAML:

    --- # Debate Winners
    Dave Lee Ross:  Grover High School
    Ed Roadhouse: Lasher High School
    Mark Smith: Andover High School

If it is interpreted to be a way to find the high school for a
particular person, then the application is a proper use of YAML.
This intent would pass the litmus test, if an intermediate 
processor scrambled the key order, the pretty printer could
sort the keys alphabetically without loss of information.

However, if the items above are meant to encode the winners
of a Debate contest, where Dave is 1st place and Ed is second,
then there is a problem.  This can be seen with a user request
to ask who took 3rd place; a "ypath" tool would have no way to
answer this question.   In this case, the formulation above
would be better represented:

    --- # Debate Winners
    - name: Dave Lee Ross
      school: Gover High School
    - name: Ed Roadhouse
      school: Lasher High School
    - name: Mark Smith
      school: Andover High School

Therefore, it is seen that this is very subtle:  if the intent
is to have order significant for processing reasons (not just
for presentation) then the sequence construct should be used.
However, this does stop keys from being ordered for presentation
purposes; say alphabetically for the human reader or according
to a particular fixed key sequence such as our invoice example.

Thread: [Yaml-core] Of Tree Trunks, Snake Tails and Elephants

yaml-core