[Yaml-core] Unspecifed tags and implicits - take #4 (the return of the NULL)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Clark made the following observation:

- Given we allow to type a node by context (== path), typing it by its 
content alone becomes rare. Practically non existant in fact.

This makes it possible to reconsider the usefulness of distinguishing 
between "23" and 23. Take the following example:

	--- !scatter-graph
	- x : 1 # looks like an int
	  y: 4.5
	  label: Hi
	  point-size-in-pixels: 2
	- x : 12.4
	  y: 17.0
	  label: 8.5 # looks like a float
	  point-size-in-pixels: 4
	...

The schema (== application == intent of the author == etc.) says that 
'x' and 'y' are floats, a label is a string, and point-size-in-pixels 
is an integer. Note this solves the "looks like an int/float problem". 
The schema says that 'x' is always a float, so it doesn't care if it 
matches the int regexp (and likewise for the label).

*This is the typical case*.

Sure, there may be a case where someone will write a mixed list where 
entries may be ints, floats or strings, without any way of knowing 
which is which without looking inside each entry - but this is SO rare:

	--- !mixed-list
	- 1		# An int
	- !!float 2	# Looks like an int...
	- 2.0		# OK, a float
	- 2.5		# A float
	- Hi		# A string
	- "8.5"	# Looks like a float,
			# but quoting saves the day
	...

With this in mind, suppose we remove all distinction between plain and 
non-plain scalars. That's certainly as consistent as it gets. What are 
the implications?

The first example stays the same! The schema uses the path _anyway_. It 
_has_ to, in order to allow 'x : 1' to be valid. In this schema, the 
regexp for each type is used for _validation_ (if at all) rather than 
detection... in practical terms, it might be just calling atof(). So 
the regexp for string matches all floats, and the regexp for floats 
matches all ints, and *all is well* because its the _path_ that is used 
to determine the type.

The mixed list, however, suffers a bit. Surprisingly little, actually... 
In this case, regexps _are_ used for detection, so any case of "wrong 
auto-detection" need to be tagged. And under proposal #4, quoting 
_doesn't_ save the day:

	--- !mixed-list
	- 1		# An int
	- !!float 2	# Oops, looks like an int.
	- 2.0		# OK, a float
	- 2.5		# Also a float
	- Hi		# A string
	- !!str "8.5" # Oops, looks like a float,
			# and quoting doesn't help...
	...

My additional observation (leading to proposal #4) is that if '?str' and 
'?var' are unified as above, it is no longer useful to have three 
different unspecified tags. The main sense of having '?' tags was 
distinguishing between '?str' and '?var'. After all, the parser reports 
the kind of the node _anyway_ (it has to), so it is quite sufficient to 
have just one "I have no tag" indication.

What did we get?

- If a node has no tag, the parser reports it has no tag.

- Each node's style, indentation, comments, etc. MAY be discarded by the 
parser. Or it might not. However...

- The native data type for nodes with a tag is determined by the tag and 
only by the tag. This is a restriction on "what is a YAML schema". An 
application may cheat, but then it wouldn't be using a YAML schema.

- The native data type for nodes without a tag is determined by the 
node's context (path to the node) and its content, and only by the 
context and content. Again, this is a restriction on "what is a YAML 
schema".

It is expected that the vast majority of the schemas will use only the 
context ( { x : ... => float, y : ... => float, label : ... => string). 
"Mixed" data structures are very rare. It is OK to use "content" in 
such cases (e.g., a list of shapes where { x : ..., y : ..., r : ... } 
=> circle, { x : ..., y : ..., w : ..., h : ... } => rectangle).

- Every native data type must have a tag associated with it. Another 
restriction on "what is a YAML schema".

- Hence, there MUST be a (possibly implicit) step that (possibly 
implicitly) provides a tag for each node having no tag. Lets call this 
step "(automatic) tag completion".

- The '!' and '!!' tag-specifiers are now valid. There's no longer a 
restriction that the "suffix" of a %TAG be non-empty.

- !!str, !!map and !!seq move out of the spec.

What did we gain (compared to #3++)?

- No special cases. Hence:

- No pesky issues about why style bleeds into the info model. It 
doesn't.

- No pesky issues about how %TAG relates to a simple ! and !!. Ther are 
treated as usual.

- No pesky issues about transformations. There aren't any.

Why? because automatically computing a value that wasn't explicitly 
specified is completely different from taking a value that _was_ 
explicitly given and _replacing_ it by another value.

Sure, some people might say it is still equivalent in some deep 
philosophical level - let them, I think the above is good enough an 
answer and if they don't buy it, well, tough, I don't have another 
answer for them :-)

What did we lose (compared to #3++)?

- The ability to use quoting as a way of enforcing a scalar to be 
interpreted as a string in a mixed-list. Which NEVER happens in any 
config file or data serialization file I can think of (off the top of 
my head). I guess there's some guy in China that actually uses 
mixed-list this way... fine, he'll have to use !!str.

All in all, I think this is more than a fair trade. We trade all the 
downsides that were so deeply discussed with just one small one that is 
never (well, hardly ever :-) seen in practice.

Thoughts?

Have fun,

	Oren Ben-Kiki