From: Oren Ben-K. <or...@be...> - 2004-09-10 23:54:00
|
Clark made the following observation: - Given we allow to type a node by context (== path), typing it by its content alone becomes rare. Practically non existant in fact. This makes it possible to reconsider the usefulness of distinguishing between "23" and 23. Take the following example: --- !scatter-graph - x : 1 # looks like an int y: 4.5 label: Hi point-size-in-pixels: 2 - x : 12.4 y: 17.0 label: 8.5 # looks like a float point-size-in-pixels: 4 ... The schema (== application == intent of the author == etc.) says that 'x' and 'y' are floats, a label is a string, and point-size-in-pixels is an integer. Note this solves the "looks like an int/float problem". The schema says that 'x' is always a float, so it doesn't care if it matches the int regexp (and likewise for the label). *This is the typical case*. Sure, there may be a case where someone will write a mixed list where entries may be ints, floats or strings, without any way of knowing which is which without looking inside each entry - but this is SO rare: --- !mixed-list - 1 # An int - !!float 2 # Looks like an int... - 2.0 # OK, a float - 2.5 # A float - Hi # A string - "8.5" # Looks like a float, # but quoting saves the day ... With this in mind, suppose we remove all distinction between plain and non-plain scalars. That's certainly as consistent as it gets. What are the implications? The first example stays the same! The schema uses the path _anyway_. It _has_ to, in order to allow 'x : 1' to be valid. In this schema, the regexp for each type is used for _validation_ (if at all) rather than detection... in practical terms, it might be just calling atof(). So the regexp for string matches all floats, and the regexp for floats matches all ints, and *all is well* because its the _path_ that is used to determine the type. The mixed list, however, suffers a bit. Surprisingly little, actually... In this case, regexps _are_ used for detection, so any case of "wrong auto-detection" need to be tagged. And under proposal #4, quoting _doesn't_ save the day: --- !mixed-list - 1 # An int - !!float 2 # Oops, looks like an int. - 2.0 # OK, a float - 2.5 # Also a float - Hi # A string - !!str "8.5" # Oops, looks like a float, # and quoting doesn't help... ... My additional observation (leading to proposal #4) is that if '?str' and '?var' are unified as above, it is no longer useful to have three different unspecified tags. The main sense of having '?' tags was distinguishing between '?str' and '?var'. After all, the parser reports the kind of the node _anyway_ (it has to), so it is quite sufficient to have just one "I have no tag" indication. What did we get? - If a node has no tag, the parser reports it has no tag. - Each node's style, indentation, comments, etc. MAY be discarded by the parser. Or it might not. However... - The native data type for nodes with a tag is determined by the tag and only by the tag. This is a restriction on "what is a YAML schema". An application may cheat, but then it wouldn't be using a YAML schema. - The native data type for nodes without a tag is determined by the node's context (path to the node) and its content, and only by the context and content. Again, this is a restriction on "what is a YAML schema". It is expected that the vast majority of the schemas will use only the context ( { x : ... => float, y : ... => float, label : ... => string). "Mixed" data structures are very rare. It is OK to use "content" in such cases (e.g., a list of shapes where { x : ..., y : ..., r : ... } => circle, { x : ..., y : ..., w : ..., h : ... } => rectangle). - Every native data type must have a tag associated with it. Another restriction on "what is a YAML schema". - Hence, there MUST be a (possibly implicit) step that (possibly implicitly) provides a tag for each node having no tag. Lets call this step "(automatic) tag completion". - The '!' and '!!' tag-specifiers are now valid. There's no longer a restriction that the "suffix" of a %TAG be non-empty. - !!str, !!map and !!seq move out of the spec. What did we gain (compared to #3++)? - No special cases. Hence: - No pesky issues about why style bleeds into the info model. It doesn't. - No pesky issues about how %TAG relates to a simple ! and !!. Ther are treated as usual. - No pesky issues about transformations. There aren't any. Why? because automatically computing a value that wasn't explicitly specified is completely different from taking a value that _was_ explicitly given and _replacing_ it by another value. Sure, some people might say it is still equivalent in some deep philosophical level - let them, I think the above is good enough an answer and if they don't buy it, well, tough, I don't have another answer for them :-) What did we lose (compared to #3++)? - The ability to use quoting as a way of enforcing a scalar to be interpreted as a string in a mixed-list. Which NEVER happens in any config file or data serialization file I can think of (off the top of my head). I guess there's some guy in China that actually uses mixed-list this way... fine, he'll have to use !!str. All in all, I think this is more than a fair trade. We trade all the downsides that were so deeply discussed with just one small one that is never (well, hardly ever :-) seen in practice. Thoughts? Have fun, Oren Ben-Kiki |