From: Oren Ben-K. <or...@ri...> - 2001-08-08 14:08:33
|
Brian Ingerson [mailto:in...@tt...] wrote: > I thought we had agreed that: > > key: > !: classname > "!": just another key > > would be parsed into an applications map as (possibly): > > key: > __!__: classname > "!": just another key > > You just can't use "__!__" for a key, or whatever the > implementor decides. Well, we never agreed on such a mapping - It was raised as a possible mapping for '!class'. As far as I recall it falls under Clark's "no shorthands!" approach... On the other hand, I think that the re-write isn't really necessary, Assuming the low-level API knows whether it has seen "!" as opposed to !. See below... > > > I disagree. If the application can control how the text is > > > emitted (simple vs. quoted) and can also tell which form the > > > parser encountered, then the heuristics for numbers could be > > > put at the application level. There would be a suggested > > > implementation (like yours below) but it would not be mandatory. > > > The default would always be string. That means that some data > > > might not round trip between Python and Foo. But how could it > > > anyway? Foo does not support integers. :) > > > > I'm unaware of any language which doesn't have the concept > > of integers and floats :-) > > A) I think that there *are* languages that only have strings. > What about shell scripts? I think Tcl at least used to be all > strings. There may be others in the future. Shell doesn't have references either. Or nulls. So you'd want to make them part of this optional layer as well. Well, using "everything is a string" is my favorite approach (even if referencing/typing a scalar becomes a bit more awkward). Are you willing to go that far? If so, see below... > > How would the application tell what form the parser has > > encountered, if the information model only allows it to return > > one of int/float/string/null? Surely not by some meta-information > > attached to the data - that would bring us right back to the > > 'class' magical attribute. Same problem exists for quoting "!" as > > a key, etc. > > No, the low level API needs to return what type of quoting > was used. The mid level API will decide what data type to > encode it in for that particular language. We will of course > have guidelines that should be followed, similar to the patterns > you posted. The end user API will not care. It will just > expect that the correct thing was done if the languages are > homogeneous, and that the best possible thing was done if the > languages are heterogeneous. It seems we are saying the same thing, then. What you call "the low level API" is what I called "the application being able to tell the parser what to do". What you call "the middle level API" is what I called "the YAML parser API", and of course we both agree that the high level API is just the native data structures of the languages (with some simple load/save operations). So, one way to do YAML is: - In the low level API, everything is a string. Including '~', '*12', etc. Scalar style is reported. Call this the "lexer" level. A token would be something like: (indent-level, style, text) where "indicator" would be one of the "styles". So: key: value another: key: | code Would become: (0, unquoted, key) (0, indicator, :) (0, unquoted, value) (0, unquoted, another) (0, indicator, :) (1, unquoted, key) (1, indicator, :) (1, indicator, | (2, block, code) ... or something like that, anyway - a good, solid basis to work with, no matter how you intend to process the data. I like it! - A middle layer, under application control, converts this into a native data model. It handles ints, floats, nulls, references and classes. "!" is left unchanged, ! is used to look for a class name. "&" is left unchanged, & is used to define an anchor. "~" is left unchanged, ~ becomes a null. No rewrite necessary! Call this the "parser" level. Easy to write and to extend (a generic implementation could allow an application to attach "handlers" for certain key or value patterns). - The application works with the native data model. - A middle layer, under application control, converts this into events. !, & and ~ forced to be quoted (see, no re-write necessary!). unquoted ! & ~ are inserted to reflect classes, anchors and nulls. Call this the "serializer" level. - The low level events, including explicit style instructions, are used to generate an output file. Call this the "printer" level. If that correct, I'm with you. I especially like it because it allows us to define a really low-level lexer level API which can start working with, without having to enter into the details of an incremental parser-level API (which is a very big deal). The only down side: anchors and classes can't be given to lists. And attaching them to scalars is problematic. They can only be easily given to maps (using ! and & keys). Of course, one can always wrap a list or a scalar in a map which will be stripped by the parser... At any rate, are we in agreement here? > We will have to bite the lookahead bullet, yes. But as I've > always said, "don't burden the user just to save a little > complexity from the implementation side". Once we write the > code, we'll get to use an excellent syntax. We won't even > remember whether or not it was hard to do. What bothers me isn't the complexity. It is that if we do want to have a streaming YAML and handle large scalar values, we'll have an efficiency problem. Of course, an implementation may simply choose to limit the lookahead. If a value is larger then BUFSIZE it will treat it as a value and hope for the best. This BUFSIZE must not be too small - say we limit it to more then 512 characters? And one can always use % to force the issue. > > key: ''This value 'starts \and/ ends' with a single '' > > Yes. This is a good thing. Data::Denter has actually always > worked that way. It was just a momentary lapse :) OK :-) Note that this means there's no good reason to have both ' and " to quote strings-with-escapes. " is enough. Using ' would be confusing to the Perl people (they'd expect quoting not to be done). Instead, let's allow ' to be optionally allowed around simple scalars. This way, " means escapes are done, ' means they aren't. And ' is optional, to be used only when necessary (first character is an indicator, etc.) > I really don't care what the sigil is. I just won't allow swapping the > '-'ish thing for a trailing indicator. It gives me the feeling that the > ending quote is part of the data. And what if the data does end with a > tick? Not a problem: Block: ' This block ends with a single '' Just like: "This string ends with a single "" But I'm starting to warm to your POV... > What I really don't like is the quote starting on a different > line than the data. That will just confuse the heck out of > people. That is never done in programming. Why do it in YAML? Hmmm. That is a good point. OK, I'm convinced (see below for a proposal). > > - In your proposal, it is impossible to write a multi-line > > unescaped simple scalar that starts at the next line. In mine > > it is possible. Point for me? > > Not true AFAIK. > > key: > simple > scalar > > As long as there is no colon. Ah, I see. No colon in the first line only, because we don't allow multi-line simple-scalar keys, right? key: this is not a key : value key: this is a key : value Hmmm. > Sorry if I seem one-sided. I just want to get coding. I (and > others) need YAML for real life projects three weeks ago. I > think the 3 quotes idea is going in the wrong direction. Please > consider my proposals, and offer me a counter. I'm going out for > lunch for 2-3 hours. I'll be back after that. > > Let's get this thing locked :) OK, here's a compromise. It is 99% your syntax, 1% mine: We use a lexer/parser/application layering along the lines I described above. At the lexer level: - |, _- : block, as today's blocks. - "quoted text" : as today's quoted string. - 'quoted text' : as today's quoted string, minus the escaping. - unquoted text : is the default. Is the same as 'quoted text', but is subject to type processing by the parser. - @ is optional. Only required is the list is empty. - % is optional. Only required is the map is empty, or if the first key in the map is "too large" (limit to be no smaller then 512 characters). - A scalar value starting at the next line must be quoted if it contains : anywhere in the first line. - *Every* scalar is a string. - There are no shorthands. - The information model is map/list/string. At the parser level: Mapping from this model to the native data structures of a given language is application-specific. However, the following data structures are common to a wide range of applications and languages and its use is recommended (*not* required): - unquoted ~ becoming a null. - unquoted (int-pattern) used for specifying an integer. - unquoted (float-pattern) used for specifying a float. - unquoted *(int-pattern) used for specifying a reference. - unquoted & key used for specifying an integer anchor. - unquoted ! key used for specifying a class. - unquoted # key used for specifying a comment. The parser may be told to strip away comments; useful for an application which isn't going to write the file back, and wants "just the data", e.g. an HTTP server reading its configuration file. Additional keys and value patterns may be used in an application-specific way. For example, enhanced reference mechanisms (path-based, URL-based); additional value types (date, currency, big-int, binary, etc.); If use of any such enhancement becomes widespread enough it may be included in an updated version of the "recommended" list above. Agreed? Have fun, Oren Ben-Kiki |