| I think what you *must* be thinking of is the *automatic*
| dumper. You hand it an object and say 'Dump this' and
| it must figure out what to do on its own.
This is the common use case; preserve the in-memory
structure via a native->yaml->native round-trip.
| I think you have to let the user decide how to treat
| pointers. It's true that you might offer him some
| policies that you think are likely to be useful most
| of the time, but you must allow him to choose anything
| at all in some cases.
Python doesn't have pointers; it only has references.
Perl has both, but pointers are rarely used and need
only be preserved (thus the !ptr construct). "C" only
has pointers. References are implemented with pointers,
but this is done via specific conventions and/or
intermediate code which implements pointer integrety, etc.
This pointer/reference distinction is necessary since most
of the languages we work with it make the distinction.
References differ from pointers in that a reference
is symmetric. Once you have a refernece to an object
you have no more / no less rights than any other process
which has a reference. This symmetry isn't true with
pointers in general, pointers can be invalid; references
by definition are never invalid.
| This is an implementation problem: giving the user
| his choice of policies. I can think of a model for
| type structures (must be something like the DTD)
| that the user can tag with 'identity' or 'value'
| for each pointer. This would be pretty general.
| Also, quite difficult to explain to folks . . .
| However, the point is not to put restrictions
| on YAML arising out of *implementation* problems
| in the dumper.
Yes; ideally the schema language (when it gets developed)
will help keep all of this stuff clear. And yes, we don't
want YAML itself to make the decision, although it should,
IMHO still provide a reference mechanism.
So, I was proposing a limitation on a few distinct
scalar types (!int, !str, !float) and their behavior.
There is nothing then stopping a user from defining
an equivalent type with different behavior if the
default behavior here isn't acceptable.
| How about this? Let's say that anything composed
| of purely basic data types *never* has pointers
| for creating identity. That's only found in
| user types.
I didn't grok this.
| Then there is NO REASON to use the !ptr data type.
| The appearance of an alias implies the pointer.
| It does not appear explicitly in the dump.
Yes; but an alias(reference) is very different from
a pointer. The following are equivalent...
one: &001 value
two: &001 value
Which key "value" was provided via an anchor and which
key is the alias is *not* informational. The distinction
between the two structures above are identical in the
generic model since alias position and key order need
not be preserved (i.e. a process cannot count on these
two items being preserved).
However, by constrast, the following are different...
one: &001 value
two: &001 value
Since in the first example, "two" is a pointer to "value",
while "one" is "value". In this way you can think of
everything as using reference semantics; and for YAML
purposes a "shared" pointer is implemented through the
reference mechanism. Of course, the first one above,
is equivalent to...
=: &001 value
| Also, the user knows where to expect pointers,
| since they're part of his user-type.
In Python you don't have pointers, one only has
references. In Perl, the runtime knows where
the pointers ("hard references") are, beacuse it
has a different syntax and stores them differently.
Java is like Python, no references.
"C" and a few other languages have pointers,
and in this case the user would have to write
their own dumper/loader pair.
| If the automatic dumper runs across a pointer in a
| purely native structure, it simply
| dereferences it and forgets it ever saw it.
| How's that for a solution?
In Perl, this solution would prevent hard references
(aka pointers) from being preserved; and thus I think
it is a horrible idea.
A "C" language, the loader/dumper is custom anyway,
so it is up to the programmer. As for C++, I doubt
that the C++ reflection is strong enough to do a
proper "generic" save program. Thus, this is also
up to the programmer.
| (If the dumper encounters a circular chain of references,
| it simply dies. This is time-honored behavior. I have
| seen it in Lisp, Prolog, and one or two others.)
This might be a good solution for C/C++ custom
YAML loader/dumper. In general, cycles occur
in data, I have cycles in many of my python
structures and automatically saving them is
what YAML is all about. Collections have
pointers to their objects, and quite often
objects refer back to their collections.
| > - We want to be able to write long scalars - specifically, strings (and
| > binary data) - only once in a YAML file.
| I don't think this is important enough to justify keeping
| aliases if my argument above persuades you.
XML's lack of graph support causes it no end
of grief. Each application does it their
own way, and thus making generic tools that
operate at the graph level are impossible.
The requirement of dealing with graphs is
part of YAML's core and it is one of those
assumptions that is not subject to review.
| As I say above, I think it's necessary to give the user a finer
| choice than the proposals (Never sharing. Sharing on all except
| leaves. Always sharing.)
My suggestion (and I'm still thinking about it...) isn't as
broad a brush. It breaks down into two categories:
(a) Atomic objects, such as int and float, are often stored
using a value and not using an "object" with an identity.
As such, in the graph model we specify that aliases applied
directly to such atomic objects may be dropped. In the YAML
specification we name only two such families which have
atomic behavior -- !int and !float.
(b) String objects, which are often treated as an "immutable"
object and frequently not compared by address. For this
case we specify that strings may be treated as atomic objects,
and this may gain/loose aliases in the generic model.
Thus, according to the generic model, the two documents
below will be considered identical, since aliases are
- &001 23
- &002 Hello
- &001 23
- &002 Hello
The advantages of this proposal
a) Part a supports native int, float types in most languages
without requiring that address (a group of objects sharing
the same alias) need be maintained. Otherwise, to meet the
requirement of the graph model; all !int and !float types
with anchors would have to be wrapped.
b) Supports readability by specifying that strings can be
treated as atomic values. In most languages, the specific
address of a string is rarely used. Thus, this is a good
95% compromise. We loose the string's identity, but
gain greater readability.
For python, the rules of identity for atomic objects are
more or less random. Two integer objects of less than 999
will _always_ share the same memory address, but integer
objects greater than 999 may be different depending on
the parser state; it may not even be deterministic. Strings
are somewhat similar, most of the time identity is preserved,
although there may be cases where identity isn't preserved.
| Not that the proposals aren't good 99% of the time (if the
| user can choose between them), but *sometime* he'll need
| to make finer-grained choices.
This is what we are trying to do; the above "restrictions" on
the base !float, !int, and !str type are probably what we want
99% of the time. And in the cases where you don't want this
property... you can make your own custom types.
Anyway. The alternative is to be "strict" by default, and
allow a "dumper" option to "de-normalize" the graph by
duplicating alias values, and allowing a "loader" to
"normalize" strings by interning strings. This would give
the greatest readability; but would not be the default.
I suppose we could keep "strict" as the default option,
but let the loader/dumper use a relaxed model as a non-default
option... perhaps this is the principle of least suprises?