From: Oren Ben-K. <or...@ri...> - 2002-09-11 12:16:43
|
It is difficult to plunge into such a discussion and answer each previously raised point individually. I had a night to sleep over this and had many ideas, most of which you guys have raised. I started thinking that forcing the progression from syntax to native model into a linear path just won't do. This has been raised in several posts (e.g., the "cube" notion). Clark's representation is almost exactly what I had in mind: > SYNTAX > | > EVENTS > / \ > TYPED UNTYPED > EVENTS GRAPH > \ / > TYPED > GRAPH I presume from the above that the native data structures of an application are considered to be a "typed graph". Correct? Then thinking about it some more I had some doubts. These focus on TYPED-EVENTS. It isn't clear to me that this is (1) useful in any way; and (2) that it is logically consistent. Usefulness: I can't think off the top of my head of any code that I'd want to run at that level. Can anyone come with an example that would want to do so? Consistency: If I have a collection that contains an alias to a node, how can I realistically "type" it? Certainly I can type _scalars_... But collections seem to require tracking all contained alias nodes to their referents - that is, using a graph. Hence above tangle is "inevitably" converted to the neat progression: SYNTAX | EVENTS | UNTYPED GRAPH | TYPED GRAPH Many of our problems about YPATH etc. were caused by the lack of an UNTYPED-GRAPH model in the current spec. The current spec looks like: SYNTAX | EVENTS | (TYPED) GRAPH | NATIVE Now, NATIVE is just another way to represent a typed graph (as has been pointed out in some posts) and it isn't really a separate model (which was another problem with the current spec). If you look at our discussions from a certain angle, it is almost all bout whether the graph is typed or not; in fact the core of my "unknown types" proposal was that it is _not_ typed. Perhaps we can continue to use the names SYNTAX, SERIAL, and (untyped) GRAPH, and rename NATIVE to TYPED: SYNTAX -(parser)-> SERIAL -(deserializer)-> GRAPH -(loader)-> TYPED TYPED -(dumper)-> GRAPH -(serializer)-> SERIAL -(emitter)-> SYNTAX I think that nicely captures the differences in semantics of each model. I also think that it is consistent with everyone's proposed models (Tom's included :-). It is also accepted by all that any specific piece of code may use information from any level it wishes to. It is OK to write a YPretty that provides some type-specific processing on the one hand (e.g., reformats integers from decimal to hex), while at the same time explicitly preserves key order and throwaway comments. So far I believe I expressed a "consensus" reached in the list. Now all that's left is the "small" issue of the definitions of the separate models. The problem is these "small" differences have "big" consequences later on. I thought that to discuss this in any concrete way we need to do so in the context of applications that make use of particular models. This turns out to be very hard to do, because our two canonical examples straddle models. YPretty works mostly at the SYNTAX level; it is aware, for example, of comments, how long lines were wrapped, how characters were escaped and so on. On the other hand, YPretty may reach as far as the TYPED model, e.g. convert the format of integers or of Booleans. YGrep works mostly at the TYPED level. It needs to be able to do type-specific comparisons. However, when a type is not "known", it must resort to using the GRAPH model. Also, YGrep may not be beyond reaching down as far as the SERIAL model (e.g., it is a nice property of a YGREP to emit matched keys in the same order as they were given in the input text file). In fact, the only piece of code that we can say anything clear about is the "application itself" that works in the TYPED model and _only_ in the TYPED model. As a result we have two anchor points - the SYNTAX and the TYPED model - and anything else is subject to debate because YPretty and YGrep can do their work "anyway". Another way to try to clarify what is where is to focus on the modules that convert between the models. That turns out not to be a big help either; there's nothing "obviously" wrong with the parser doing typing according to regexps or the loader doing so (to give an extreme, but very relevant, example). A weaker way to resolve issues is to consider any syntactical effects they may have, and decide according to our goal priorities: readability first; interaction with scripting languages second; using native data structures third; and so on. I think we all mostly agree on readability; the syntax we have is pretty stable so that minor changes to it are easily labeled as more or less readable (e.g., () around integers are obviously less "readable"). The second and third goals require we determine the target scripting languages. We've focused on Perl and Python, with me watching out for Java, and we now have Ruby and PHP folk as well, which is a pretty good cross-section if languages, I think. So. Enough chatter. I want to respond to Clark's 4-model proposal in the context of all the above, especially the latter point. > [SYNTAX] -> [EVENTS] -> [GRAPH] -> [NATIVE] > (parser) (loader) (binder) > > [SYNTAX] <- [EVENTS] <- [GRAPH] <- [NATIVE] > (emitter) (dumper) (un-binder?) I suggested above a rename as follows: [SYNTAX] -> [SERIAL] -> [GRAPH] -> [TYPED] parser deserializer loader [SYNTAX] <- [SERIAL] <- [GRAPH] <- [TYPED] emitter serializer dumper Since "native" is one way to implement "TYPED", this keeps the loader and dumper where we all expected them to be. > [SYNTAX] > The syntax stage represents a YAML text file or a > data from a socket. It must follow the productions > as given below. Included in this information is: > - declarations (#YAML:1.0 for example) > - comments (# something or other) > - encoding (which encoding was used) > - styles (which syntax forms were used) Syntax also includes where long lines were broken and what indentation was used and which escape sequences were used and so on. Think of syntax as the result of running a lexer on the text file. The next step is obviously: > (parser) > The parser reads information from the SYNTAX stage > and produces EVENTS, which is a discrete serial > representation of the information. I think SERIAL model is a better terms. "Events" smacks of a particular implementation technique, and not the preferred one at that; I'd much rather use a pull parser than a push one. > [EVENTS] > Each YAML document is reported as a Node. > A node is either a scalar, sequence, pairing, or alias. > A scalar is a series of zero or more unicode characters ("string") > A sequence is an series of zero or more nodes > A pairing is a series of zero or more pairs, where a > pair is exactly two nodes, a key and a value. > > Each scalar, sequence or mapping node also has a transfer string. > For those nodes without an explicit transfer in the SYNTAX > representation, the following rules are used: > - If the node is a sequence, it's transfer is !sequence > - If the node is a pairing, it's transfer is !pairing > - If the node is an unquoted non-block scalar, its > transfer is !unknown > - Else, if the node is quoted or block, its transfer is !string Here I disagree (one of these "small" issues :-). I know I agreed with the above in the past, but after some reflection I think that a better way is: - If a scalar node has no explicit format, the format is "implicit". - Collection nodes have no format. - If a node has an explicit type family, that's it. Otherwise: - If the node is quoted or block, its type family is !string - If the node is a pairing, its type family is !implicit-pairing - If the node is a sequence, its type family is !implicit-sequence - If the node is a scalar, its type family is !implicit-scalar This is a big deal, really, because it means that just like the loader can use string-regexps to decide how to type implicit scalars, it may use ypath-exps to decide on how to type collections. I see no big difference between saying that \d+ is an implicit integer and saying that a map that contains only x and y keys is a point. This is a marked increase in readability, since it greatly reduces the need to provide an explicit transfer. On the other hand, it places a greater burden on the typing-schema (see below). > Each scalar, sequence, or pairing may have an anchor string > > An alias has an anchor string, such that at least one node > occurring prior in the stream has an equal anchor. > > Each node has exactly one parent with which it participates. > > Notes: > - The pairing preserves key order > - The pairing allows for duplicates > - These are the minimal elements, more > informational elements from the SYNTAX > model, such as style may be included Well, yes; _all_ models define the "minimal information" and libraries are welcome to provide more. > - You can think of this as similar to XML's SAX Exactly. > (loader) Rename: de-serializer. > The loader's reponsibility is to convert the EVENTS > into an in-memory GRAPH. It does this by replacing > anchors with the most recently referenced node with > the same anchor. > > [GRAPH] > > This stage is similar to the EVENTS stage only that: > - the alias nodes are not present > - the anchor property need not be present > - the parent property is not present > > Notes: > - each node is allowed to participate in more > than one pairing or sequence > - You can think of this as similar to XML's DOM > > > ===> Here down is only interest to those doing > native serialization and transformation > applications with YAML. That's a shade too restrictive, unless one views YGrep as a transformation. Hmmm. I guess it is at that. > (binder) Rename: loader. > The binder takes information from the GRAPH and > resolves all types. The result is a graph using > data types NATIVE to the given system. Rename: TYPED. OK, this is subtle. "By definition" all the data returned from any YAML library at any level is native (you can't write a C library that doesn't return a C data type, even if it is "void*"). In that sense the above is an empty statement. The idea is that the representation _may_ be just the normal native data structures used by some application. It may also be some sort of YamlNode. It may be a combination (e.g., when a mixture of known and unknown types are used). Hence there's no need for a separate "functional" vs. "native" model; the current draft has this distinction (generic vs. native) and it serves no useful purpose. > - If a given node's transfer is !unknown then > a warning is issued to the user, and it is > treated as if it were a !string. This > warning can be supressable, and need not > be duplicated for each node.. !implicit-scalar => !str !implicit-pairing => !map !implicit-sequence => !seq Hmmm. We have 'pairing' vs. 'mapping', but use 'sequence' across the levels; we should either find two names for it, or use a single name for mapping. I'm not certain using two names doesn't confuse the issue. We've had a phase where we had a separate named for each level and it didn't work too well. I'd stick with 'mapping' all the way. Another small point: if the format of a scalar node is 'implicit' it is converted to 'canonical'. > - If a given node's transfer is not known by > the binder (the binder is unable to create > a native representation); then an error > condition occurs and the given text does > not have a NATIVE representation. > > - For each pairing, if the set of key nodes > contains a duplicate; then an error condition > occurs and the given text does not have a > NATIVE representation. The following is a bit of a leap. It refers to alternative ways to handle !implicit-<kind>. So first we need to say that a loader is allowed to convert implicit type family and format to explicit type family and format. The following refers to constraints on how it does that: > - When constructing a NATIVE binding for a scalar > node, the conversion must be only a function of > the node's character content and transfer. That is, deciding on the type family and format of scalars can only be driver by regexps, _not_ by the path to the node. The equivalent for collection nodes is that deciding on the type family may only contain on the content of the node, not its path. Hmmm. I see where this is coming from; it makes implicit typing more manageable. The problem is that I can't say that, e.g., the 'zip-code' field is a string no matter what; if integers are used anywhere in the document, I'd have to quote the field. I'm neutral here right now; I think we need to discuss this a bit, though. > - When constructing a NATIVE binding for a pairing > node, the relative order of the pairs may not > be taken into account, nor any other information > other than the nodes which are paired. > > Notes: > > - This is the built-in binder for a YAML > system and is optimized for "object serialization" > of common languages like Perl and Python; it is > an 80% solution. For the PHP people: my proposal above that you can implicitly type collections means that you can de-serialize any node that is a sequence of one-key pairings into a PHP thingie (what's its name, anyway?). No !omap required. Isn't YAML great? And it doesn't harm interoperability. Perl will load it into a list of hashes, but what else can it do? Unless you add an implicit rule there to load it into some specialized class as well. > - The user may provide their own binder, to > transfer into a application-specific structure; > for example, into an XML DOM or native C structs. The output of the loader isn't specified anyway (other than "some native data types"). The one constraint is that it should be possible to extract the type family and the same canonical value from the resulting native object. That's all. Given the restrictions you said above, you could just say that the loader may be given a list of regexps (and y-exps) to drive the implicit type family and format resolutions. That's all the customization you really allow. > - The loader and binder can be combined into > a single step as an optimization to skip the > creation of a generic graph intermediate. Yes. > [NATIVE] Rename: TYPED (see above). > The native representation for YAML is a graph of > native objects. Objects are either Scalar or > Collections. Collections are functional, "mappings". The thing is, it is a requirement that the type family (and the canonical value) should be implicit in the object; that is, that it should be able to "get" them somehow. Each type family implies a single kind (mapping/sequence/scalar). Lump collections together as much as you want, I'd _still_ be able to implement an isMapping and isSequence function. So it isn't that I disagree as much as it is simply impossible given the previous models. But you knew that already :-) Perhaps it is a good point to say a word about type families. Basically, they are the same as in the current spec; each implies a kind, scalars have formats (canonical, implicit and explicit). A new feature I suggested above is that collection type families have a single y-exp that may be used for implicit typing. Since this is new and has implications, we'll need to discuss this a bit; I think it is required for symmetry and that it solves many problems (e.g., the PHP one). A note: much depends on what this y-exp can check; I expect it at minimum to be able to specify the kind and/or type family of the sub nodes, as well as key values for mappings. > Note: Many YAML tools such as YPATH, YSCHEMA, YTRANSFORM, > will work on the Native binding of YAML data. > The strict functional requirement will be > assumed by these tools. Work on the TYPED model - for sure. I don't think they'll be all strict about it, however. I gave the example of YGrep for an integer in the presence of an IP address in the file; I think that should work fine. Obviously validation would be strict, but in general such tools will be able to employ the GRAPH model when types are unknown. OK, I have to rush to a meeting. Back in an hour or so. Have fun, Oren Ben-Kiki |