[Yaml-core] YAML as a graph

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'd like to revive the YPATH thread. Before jumping into details, however,
I'd like to establish some common ground first.

We have talked a lot about "the graph" we are going to use when we define
our generic tools.  From my talks with Clark and Brian it seems that
regardless of the formal defintion, not everyone has the same notion of what
this graph looks like. As the purpose of YPATH is to describe paths in the
graph, this has lead to some misunderstandings.

So, a brief informal tour into graph theory and its application to YAML...

Mathematically, a directed graph is a collection of vertices and edges: G =
(V, E), where each edge has a source and target vertice (E is a subset of V
x V). Applied to YAML, each node is obviously a vertice, and edges are
defined from collections to contained nodes.

In addition, both the edges and the vertices may have arbitrary properties.
For YAML, the node (vertice) properties are defined in the spec - type
family, kind, format, alias, style, etc.

There are some subtelties involved, best described using examples. The most
trivial non-empty document has just one node:

# Document 1
--- &HELLO "hello"
...

And in a graph form there's only one vertice (let's call it after the
alias):

# Graph 1
HELLO

The next step is a simple collection:

# Document 2
--- &MAP
&K1 "k1" : &V1 "v1"
&K2 "k2" : &V2 "v2"
...

Note that K1 and K2 are nodes like any others, and hence deserve their own
graph vertices. In graph form this would NOT be:

# Graph 2.1
MAP --- V1
 \   |
  \  K1
 / \
K2  V2

The above is emphatically _not_ a "directed graph". I suppose that graph
theory has some way to handle this, but my bet is that the way to do it is
to convert it to some normalised graph form, so we might as well save
ourselves the trouble and start with one.

Another way the graph form can NOT take is:

# Graph 2.2
MAP -- K1 -- V1
    \
     \
       K2 -- V2

To show why this doesn't work, let's use a slightly more complex example:

# Document 3
--- &MAP1
&K1 "k1": &MAP2
  &K2 "k2" : &V2 "v2"
*K2: &V3 "v3"
...

The above is a valid YAML document. When we try to cast it into graph form,
we run into trouble:

# Graph 3.1
MAP1 -- K1 -- MAP2 -- K2 -- V2
     \______________/    \
                          \
                            V3

This does not preserve the information that V2 is only accessible through K2
when K2 is accessed through MAP2, and that V3 is only accessible through K2
when K2 is accessed directly through MAP1.

(For the nit-pickers: for scalar nodes, as K2 is above, we could simply
introduce a copy of K2 - call it K3 - and resolve the problem this way.
However, YAML allows for collection keys, for a good reason, so this doesn't
solve the problem in general).

To fully preserve the document semantics the graph must look like:

# Graph 3.2
DOC -- (P0) -- K1
    \       \
     \       \
       (P2)    MAP1
     /     \        \
    /       \        \
 V3           K2 ----- (P1) -- V2

This requires the introduction of "virtual vertices" for each key/value pair
(P0 - P2). Note that for the original simple example, the graph would look
like: 

# Graph 2.3
MAP -- (P0) -- V1
    \       \
     \       \
      (P1)     K1
     /    \
    /      \
 K2          V2

Which is exactly the same as graph 2.1, but with vertices added to
"normalize" the graph. This seems like the simplest thing that can possibly
work.

To be able to describe every YAML document as a graph, we also need to
handle sequence collections:

# Document 4
--- &SEQ
- &V1 "v1"
- &V2 "v2"
...

The simplest possible way to cast this to a normal directed graph is:

# Graph 4.1
SEQ -- V1
    \
     \
       V2

This graph, however, does not preserve the index associated with each entry.
Similarly to the case for mappings, there are several options.

The index can be associated with the edge (similarly to graph 2.1):

# Graph 4.2
SEQ --- V1
 \   |
  \  0
 / \
1    V2

It can be viewed as a virtual vertice for each value (similarly to graph
2.2):

# Graph 4.3
SEQ -- (0) -- V1
    \
     \
       (1) -- V2

Or it can be viewed as a virtual key (similarly to graph 2.3):

# Graph 4.4
SEQ -- (P0) -- V1
    \       \
     \       \
      (P1)     (0)
     /    \
    /      \
(1)          V2

While the latter view is the most complex one, it has a big advantage of
allowing sequences and mappings to be treated uniformly. YPATH must allow us
to select a value nodes in a mapping based on the value of the key node. For
example, it should be possible to obtain the value "1" out of the following
document by asking for the value associated with the key "x":

---
x: 1
y: 2
...

Similarly, it should be possible to obtain the value "a" out of the
following document by asking for the sequence entry with the index 0:

---
- a
- b
...

If we use graph 4.4 to describe the second document, we can use the exact
same YPATH mechanism in both cases.

So much for graph theory.

By definition, YPATH is a mechanism for selecting paths (sequence of edges)
in the graph. Hence we must to agree on what this graph looks like. As you
can see, while it is possible to view any YAML document as a graph, the
result isn't as obvious as we would have like it to be (all these "virtual
vertices").

This raises a meta-issue. Maybe we should re-think the name "graph model"...
perhaps the term "functional model" is more suitable. The problem is, I'm
not aware of a construct that is similar to a "graph" but that uses a
"function" instead of a "vertice" as its basic building block...
Nevertheless, that's the best fit for YAML data (and, by implication, for
most data represented in computer programs). Surely someone did some work on
this? The closest I can come is some vague recollection of the "web"
database model. That was mostly forgotten and then reimplemented by OODBs. I
never saw a good formalization of that... Thoughts?

At any rate, assuming we stick with a graph as the basic construct, I
suggest that we use the view demonstrated by graphs 2.3, 3.2 and 4.4 as a
common ground when discussing YPATH. This view does not promote or constrain
the YPATH syntax/semantics in any way. It is just to allow everyone to
define what the proposed syntax/semantics mean in an unambiguos way.

Have fun,

    Oren Ben-Kiki