RE: FW: [Yaml-core] A test

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Brian Ingerson [mailto:in...@tt...] wrote:
> I thought we had agreed that:
>  
>     key:
>         !: classname
>         "!": just another key
> 
> would be parsed into an applications map as (possibly):
> 
>     key:
>         __!__: classname
>         "!": just another key
> 
> You just can't use "__!__" for a key, or whatever the 
> implementor decides.

Well, we never agreed on such a mapping - It was raised
as a possible mapping for '!class'. As far as I recall
it falls under Clark's "no shorthands!" approach...

On the other hand, I think that the re-write isn't really
necessary, Assuming the low-level API knows whether it has
seen "!" as opposed to !. See below...

> > > I disagree. If the application can control how the text is 
> > > emitted (simple vs. quoted) and can also tell which form the
> > > parser encountered, then the heuristics for numbers could be
> > > put at the application level. There would be a suggested
> > > implementation (like yours below) but it would not be mandatory.
> > > The default would always be string. That means that some data
> > > might not round trip between Python and Foo. But how could it
> > > anyway? Foo does not support integers. :)
> > 
> > I'm unaware of any language which doesn't have the concept
> > of integers and floats :-)
> 
> A) I think that there *are* languages that only have strings. 
> What about shell scripts? I think Tcl at least used to be all
> strings. There may be others in the future.

Shell doesn't have references either. Or nulls. So you'd want
to make them part of this optional layer as well. Well, using
"everything is a string" is my favorite approach (even if
referencing/typing a scalar becomes a bit more awkward). Are
you willing to go that far? If so, see below...

> > How would the application tell what form the parser has
> > encountered, if the information model only allows it to return
> > one of int/float/string/null? Surely not by some meta-information
> > attached to the data - that would bring us right back to the
> > 'class' magical attribute. Same problem exists for quoting "!" as
> > a key, etc.
> 
> No, the low level API needs to return what type of quoting 
> was used. The mid level API will decide what data type to
> encode it in for that particular language. We will of course
> have guidelines that should be followed, similar to the patterns
> you posted. The end user API will not care. It will just
> expect that the correct thing was done if the languages are 
> homogeneous, and that the best possible thing was done if the
> languages are heterogeneous. 

It seems we are saying the same thing, then. What you call
"the low level API" is what I called "the application being
able to tell the parser what to do". What you call "the
middle level API" is what I called "the YAML parser API",
and of course we both agree that the high level API is
just the native data structures of the languages (with
some simple load/save operations).

So, one way to do YAML is:

- In the low level API, everything is a string. Including
'~', '*12', etc. Scalar style is reported.

Call this the "lexer" level. A token would be something
like: (indent-level, style, text) where "indicator" would
be one of the "styles". So:

key: value
another:
    key: |
        code

Would become:

    (0, unquoted, key)
    (0, indicator, :)
    (0, unquoted, value)
    (0, unquoted, another)
    (0, indicator, :)
    (1, unquoted, key)
    (1, indicator, :)
    (1, indicator, |
    (2, block, code)

... or something like that, anyway - a good, solid basis
to work with, no matter how you intend to process the
data. I like it!

- A middle layer, under application control, converts
this into a native data model. It handles ints, floats,
nulls, references and classes. "!" is left unchanged,
! is used to look for a class name. "&" is left unchanged,
& is used to define an anchor. "~" is left unchanged,
~ becomes a null. No rewrite necessary!

Call this the "parser" level. Easy to write and to extend
(a generic implementation could allow an application to
attach "handlers" for certain key or value patterns).

- The application works with the native data model.

- A middle layer, under application control, converts
this into events. !, & and ~ forced to be quoted (see,
no re-write necessary!). unquoted ! & ~ are inserted
to reflect classes, anchors and nulls.

Call this the "serializer" level.

- The low level events, including explicit style
instructions, are used to generate an output file.

Call this the "printer" level.

If that correct, I'm with you. I especially like it
because it allows us to define a really low-level lexer
level API which can start working with, without having
to enter into the details of an incremental parser-level
API (which is a very big deal).

The only down side: anchors and classes can't be given
to lists. And attaching them to scalars is problematic.
They can only be easily given to maps (using ! and &
keys). Of course, one can always wrap a list or a scalar
in a map which will be stripped by the parser...

At any rate, are we in agreement here? 

> We will have to bite the lookahead bullet, yes. But as I've 
> always said, "don't burden the user just to save a little
> complexity from the implementation side". Once we write the
> code, we'll get to use an excellent syntax. We won't even
> remember whether or not it was hard to do.

What bothers me isn't the complexity. It is that if we do want
to have a streaming YAML and handle large scalar values, we'll
have an efficiency problem. Of course, an implementation may
simply choose to limit the lookahead. If a value is larger then
BUFSIZE it will treat it as a value and hope for the best. This
BUFSIZE must not be too small - say we limit it to more then 512
characters? And one can always use % to force the issue.

> >     key: ''This value 'starts \and/ ends' with a single ''
> 
> Yes. This is a good thing. Data::Denter has actually always 
> worked that way. It was just a momentary lapse :)

OK :-) Note that this means there's no good reason to have
both ' and " to quote strings-with-escapes. " is enough.
Using ' would be confusing to the Perl people (they'd expect
quoting not to be done).

Instead, let's allow ' to be optionally allowed around simple
scalars. This way, " means escapes are done, ' means they
aren't. And ' is optional, to be used only when necessary
(first character is an indicator, etc.)

> I really don't care what the sigil is. I just won't allow swapping the
> '-'ish thing for a trailing indicator. It gives me the feeling that the
> ending quote is part of the data. And what if the data does end with a
> tick?

Not a problem:

Block: '
    This block ends with a single ''
Just like:
    "This string ends with a single ""

But I'm starting to warm to your POV...

> What I really don't like is the quote starting on a different 
> line than the data. That will just confuse the heck out of
> people. That is never done in programming. Why do it in YAML?

Hmmm. That is a good point. OK, I'm convinced (see below
for a proposal).

> > - In your proposal, it is impossible to write a multi-line
> >   unescaped simple scalar that starts at the next line. In mine
> >   it is possible. Point for me?
> 
> Not true AFAIK.
> 
>     key:
>         simple
>         scalar
> 
> As long as there is no colon.

Ah, I see. No colon in the first line only, because we
don't allow multi-line simple-scalar keys, right?

key:
    this is not
    a key : value
key:
    this is a key : value

Hmmm.

> Sorry if I seem one-sided. I just want to get coding. I (and 
> others) need YAML for real life projects three weeks ago. I
> think the 3 quotes idea is going in the wrong direction. Please
> consider my proposals, and offer me a counter. I'm going out for
> lunch for 2-3 hours. I'll be back after that.
> 
> Let's get this thing locked :)

OK, here's a compromise. It is 99% your syntax, 1% mine:

We use a lexer/parser/application layering along
the lines I described above.

At the lexer level:

- |, _- : block, as today's blocks.
- "quoted text" : as today's quoted string.
- 'quoted text' : as today's quoted string, minus the escaping.
- unquoted text : is the default. Is the same as 'quoted text',
  but is subject to type processing by the parser.
- @ is optional. Only required is the list is empty.
- % is optional. Only required is the map is empty,
  or if the first key in the map is "too large" (limit
  to be no smaller then 512 characters).
- A scalar value starting at the next line must be quoted
  if it contains : anywhere in the first line.
- *Every* scalar is a string.
- There are no shorthands.
- The information model is map/list/string.

At the parser level:

Mapping from this model to the native data structures
of a given language is application-specific. However,
the following data structures are common to a wide range
of applications and languages and its use is recommended
(*not* required):

- unquoted ~ becoming a null.
- unquoted (int-pattern) used for specifying an integer.
- unquoted (float-pattern) used for specifying a float.
- unquoted *(int-pattern) used for specifying a reference.
- unquoted & key used for specifying an integer anchor.
- unquoted ! key used for specifying a class.
- unquoted # key used for specifying a comment.
  The parser may be told to strip away comments;
  useful for an application which isn't going to
  write the file back, and wants "just the data",
  e.g. an HTTP server reading its configuration file.

Additional keys and value patterns may be used in an
application-specific way. For example, enhanced reference
mechanisms (path-based, URL-based); additional value types
(date, currency, big-int, binary, etc.);

If use of any such enhancement becomes widespread enough it
may be included in an updated version of the "recommended"
list above.

Agreed?

Have fun,

    Oren Ben-Kiki