On 08/08/01 17:09 +0200, Oren Ben-Kiki wrote:
> Brian Ingerson [mailto:ingy@...] wrote:
> > I thought we had agreed that:
> >
> > key:
> > !: classname
> > "!": just another key
> >
> > would be parsed into an applications map as (possibly):
> >
> > key:
> > __!__: classname
> > "!": just another key
> >
> > You just can't use "__!__" for a key, or whatever the
> > implementor decides.
>
> Well, we never agreed on such a mapping - It was raised
> as a possible mapping for '!class'. As far as I recall
> it falls under Clark's "no shorthands!" approach...
Actually, Clark wanted this IIRC.
>
> On the other hand, I think that the re-write isn't really
> necessary, Assuming the low-level API knows whether it has
> seen "!" as opposed to !. See below...
Correct. This is an implementation decision at the parser level.
> > A) I think that there *are* languages that only have strings.
> > What about shell scripts? I think Tcl at least used to be all
> > strings. There may be others in the future.
>
> Shell doesn't have references either. Or nulls. So you'd want
> to make them part of this optional layer as well. Well, using
> "everything is a string" is my favorite approach (even if
> referencing/typing a scalar becomes a bit more awkward). Are
> you willing to go that far? If so, see below...
Yes. I am willing (and eager) to go that far. :)
> It seems we are saying the same thing, then. What you call
> "the low level API" is what I called "the application being
> able to tell the parser what to do". What you call "the
> middle level API" is what I called "the YAML parser API",
> and of course we both agree that the high level API is
> just the native data structures of the languages (with
> some simple load/save operations).
>
> So, one way to do YAML is:
>
> - In the low level API, everything is a string. Including
> '~', '*12', etc. Scalar style is reported.
>
> Call this the "lexer" level. A token would be something
> like: (indent-level, style, text) where "indicator" would
> be one of the "styles". So:
>
> key: value
> another:
> key: |
> code
>
> Would become:
>
> (0, unquoted, key)
> (0, indicator, :)
> (0, unquoted, value)
> (0, unquoted, another)
> (0, indicator, :)
> (1, unquoted, key)
> (1, indicator, :)
> (1, indicator, |
> (2, block, code)
>
> ... or something like that, anyway - a good, solid basis
> to work with, no matter how you intend to process the
> data. I like it!
Yes. Exactly.
> - A middle layer, under application control, converts
> this into a native data model. It handles ints, floats,
> nulls, references and classes. "!" is left unchanged,
> ! is used to look for a class name. "&" is left unchanged,
> & is used to define an anchor. "~" is left unchanged,
> ~ becomes a null. No rewrite necessary!
>
> Call this the "parser" level. Easy to write and to extend
> (a generic implementation could allow an application to
> attach "handlers" for certain key or value patterns).
Sounds good. Great actually.
> - The application works with the native data model.
>
> - A middle layer, under application control, converts
> this into events. !, & and ~ forced to be quoted (see,
> no re-write necessary!). unquoted ! & ~ are inserted
> to reflect classes, anchors and nulls.
>
> Call this the "serializer" level.
>
> - The low level events, including explicit style
> instructions, are used to generate an output file.
>
> Call this the "printer" level.
>
> If that correct, I'm with you. I especially like it
> because it allows us to define a really low-level lexer
> level API which can start working with, without having
> to enter into the details of an incremental parser-level
> API (which is a very big deal).
>
> The only down side: anchors and classes can't be given
> to lists. And attaching them to scalars is problematic.
> They can only be easily given to maps (using ! and &
> keys). Of course, one can always wrap a list or a scalar
> in a map which will be stripped by the parser...
>
> At any rate, are we in agreement here?
Yes. Definitely. In fact, I think I may be able to use this to define a
complete round-tripping implementation for Perl that could still pass through
Python. Data::Denter::YAML is (maybe) back! :)
> > We will have to bite the lookahead bullet, yes. But as I've
> > always said, "don't burden the user just to save a little
> > complexity from the implementation side". Once we write the
> > code, we'll get to use an excellent syntax. We won't even
> > remember whether or not it was hard to do.
>
> What bothers me isn't the complexity. It is that if we do want
> to have a streaming YAML and handle large scalar values, we'll
> have an efficiency problem. Of course, an implementation may
> simply choose to limit the lookahead. If a value is larger then
> BUFSIZE it will treat it as a value and hope for the best. This
> BUFSIZE must not be too small - say we limit it to more then 512
> characters? And one can always use % to force the issue.
I think we are prematurely optimizing. Actually, I'm stealing the words
of my Swedish friend Artur. That's what he thinks after be told of the
situation. Anyway, I agree that we can use these tricks to "force the
issue" if need be. Artur's wisdom suggests doing many different
implementations so benchmarking can tell the true tale.
> > > key: ''This value 'starts \and/ ends' with a single ''
> >
> > Yes. This is a good thing. Data::Denter has actually always
> > worked that way. It was just a momentary lapse :)
>
> OK :-) Note that this means there's no good reason to have
> both ' and " to quote strings-with-escapes. " is enough.
> Using ' would be confusing to the Perl people (they'd expect
> quoting not to be done).
>
> Instead, let's allow ' to be optionally allowed around simple
> scalars. This way, " means escapes are done, ' means they
> aren't. And ' is optional, to be used only when necessary
> (first character is an indicator, etc.)
Yes. That's exactly what I want.
> > I really don't care what the sigil is. I just won't allow swapping the
> > '-'ish thing for a trailing indicator. It gives me the feeling that the
> > ending quote is part of the data. And what if the data does end with a
> > tick?
>
> Not a problem:
>
> Block: '
> This block ends with a single ''
Or does it end with two ticks and a newline? :)
> Just like:
> "This string ends with a single ""
>
> But I'm starting to warm to your POV...
Cool. :)
> > What I really don't like is the quote starting on a different
> > line than the data. That will just confuse the heck out of
> > people. That is never done in programming. Why do it in YAML?
>
> Hmmm. That is a good point. OK, I'm convinced (see below
> for a proposal).
Rock on!
> > > - In your proposal, it is impossible to write a multi-line
> > > unescaped simple scalar that starts at the next line. In mine
> > > it is possible. Point for me?
> >
> > Not true AFAIK.
> >
> > key:
> > simple
> > scalar
> >
> > As long as there is no colon.
>
> Ah, I see. No colon in the first line only, because we
> don't allow multi-line simple-scalar keys, right?
That's fine by me. But I had actually thought that it *was* allowed.
> key:
> this is not
> a key : value
I had thought that this *was* a key/value, as opposed to:
key:
'this is not
a key : value'
I'll be happy going either way. Double quotes should be fine for multi-line
keys.
> OK, here's a compromise. It is 99% your syntax, 1% mine:
>
> We use a lexer/parser/application layering along
> the lines I described above.
Definitely.
> At the lexer level:
>
> - |, _- : block, as today's blocks.
Is this a typo? Or is '_-' what you want? As I said before, I really don't
care what the indicators are, as long as the semantics remain.
Other options:
== =-
== =~
<< <-
|+ |-
=+ =-
> - "quoted text" : as today's quoted string.
Yes.
> - 'quoted text' : as today's quoted string, minus the escaping.
Is that the same as today's simple scalar?
> - unquoted text : is the default. Is the same as 'quoted text',
> but is subject to type processing by the parser.
Simple Scalar. Right? I guess we'll have to get our terminology straight.
> - @ is optional. Only required is the list is empty.
> - % is optional. Only required is the map is empty,
Yes. yes.
> or if the first key in the map is "too large" (limit
> to be no smaller then 512 characters).
Maybe premature at this point, but OK.
> - A scalar value starting at the next line must be quoted
> if it contains : anywhere in the first line.
OK. So no unquoted multi-lines as keys. Right?
> - *Every* scalar is a string.
Yes!
> - There are no shorthands.
True.
> - The information model is map/list/string.
>
> At the parser level:
>
> Mapping from this model to the native data structures
> of a given language is application-specific. However,
> the following data structures are common to a wide range
> of applications and languages and its use is recommended
> (*not* required):
>
> - unquoted ~ becoming a null.
> - unquoted (int-pattern) used for specifying an integer.
> - unquoted (float-pattern) used for specifying a float.
> - unquoted *(int-pattern) used for specifying a reference.
> - unquoted & key used for specifying an integer anchor.
> - unquoted ! key used for specifying a class.
> - unquoted # key used for specifying a comment.
> The parser may be told to strip away comments;
> useful for an application which isn't going to
> write the file back, and wants "just the data",
> e.g. an HTTP server reading its configuration file.
Excellent!
Let's reserve all funny characters. Sound OK?
> Additional keys and value patterns may be used in an
> application-specific way. For example, enhanced reference
> mechanisms (path-based, URL-based); additional value types
> (date, currency, big-int, binary, etc.);
>
> If use of any such enhancement becomes widespread enough it
> may be included in an updated version of the "recommended"
> list above.
>
> Agreed?
I'm almost positive the answer is "Yes!". Please clarify my points above.
The only thing we have left to do is to figure out the top level productions.
My sugggestion is:
"a list of zero or more maps and lists, separated by groups of two or
more contiguous EOL characters".
- Leading EOL chars are ignored.
- Top level maps and lists cannot be empty.
- An empty file (or one containing just EOL chars) is a valid list of zero
YAML top-level nodes.
Looking forward to your reply.
Cheers, Brian
|