Thread: [Yaml-core] Oren's Take

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Oh boy. Tossing an idea to this list is like lighting a fuse :-)

OK (deep breath), let's see...

Brian's basic proposal:

> delineator ::= ':' descriptors? indicator lwsp+  
> descriptors ::= descriptor ( ':' descriptor )* 
> descriptor ::= class | anchor | index
> class ::= '!' classname
> anchor ::= '*' anchorname

This should be '&' anchorname - right?

> index ::= '#' indexnumber      # possible syntax for sparse lists
> indicator ::= '' |             # List or map or inline scalar

This would be an <eol> instead of '', I think? Of course that would mean
restructuring the syntax a bit...

>               ':' |            # quoted or unquoted nextline scalar 
>               '|' |            # block with trailing newline
>               '\' |            # chomped block (no trailing nl)
>               '~'              # undefined or null value
> lwsp ::= #x20 | #x09

I'm going to rephrase it as independent modifications to my proposal:

1. Use ':' instead of ' ' to separate "descriptors".

I must say I'm not enamored of this. I find:

	this:!class:&001 value

To be much less human-friendly than:

	this: !class &001 value

Humans really like white space as separators. What is the rationale for
using ':' instead of ' ' here? The change is syntax is trivial:

descriptors ::= ( lwsp+ descriptor )+

2. Use ':' instead of '\' as the next-line scalar indicator. This allows '\'
to be used for chomped blocks instead of '|\'. Personally, I find '\' to be
more readable and less confusing, but I don't see this as a crucial point -
if both of you find:

this::
    value to be
    better than
this:\
    this value
or this: \
    value

Then I can live with it.

3. Making '~' a descriptor instead of an implicit type. I don't see why we
need to do that. Do you want to make references into descriptors as well?
Why?

4. In-line block values:

> >   |  This is a block scalar with a trailing new line
> >   \  This is a block scalar without a trailing new line

We decided against these a while back. Note that the block may not start
with white space (or can it?), and that it may continue in the next line.
I've no strong opinion here either way.

5. Using '-' instead of ':' for list entries.

As Brian pointed out, this is safe:

inline scalar: -1
next line scalar: \
    -1
list entry:
    - 1

The '-' is a tad more readable, but it means another special character. I've
no strong opinion about this. At a pinch I'd go for using '-'.

6. Having explicit type names available for text as well as map and list.

Good idea.

7. Using the names '.<type>' instead of '<type>!' for the basic types.

I'm afraid we'll be using a mechanism which may prove useful for other
purposes: e.g., "relative type names".

(Relative type names mean:

data-island: !org.someone.root-type
    data-element: !.another-type
    ...

They aren't explicitly allowed or disallowed by the spec, but I wouldn't
want to rule them out just yet. I don't want to start a long discussion
about whether they are "good", "bad" or "ugly" - I just want to leave them
as an option for the future). Besides, we may come up with some other great
use for ".<type>". Using "<type>!" is much safer.

8. '#<index>' descriptor for sparse lists.

I like the idea of sparse maps, but I don't feel strongly about the need to
support them. They can always be emulated by having explicit null entries. I
also dislike making '#' an indicator, because that would rule out using it
as a key for "comment" values. Likewise making '[' or '(' be indicators
isn't a good idea. How about '^' instead?

sparse-list:
    : ^12 Indexed entry.

Clark's comments:
> >   A.  Strengthen the packing requirement, making the
> >       indicator mandatory follow the colon.
> 
> Not sure I get you here. The indicator *is* mandatory. It 
> just happens that one possible indicator is ''.

I agree.

> >   B.  We strengthen the : indicator to specifically
> >       mean an un-quoted, un-escaped multi-line scalar.
> 
> What about it's next-line property? 
> I don't see the need for this.

I agree - and therefore I think '\' is better. Its intuitive meaning rules
out this kind of confusion.

> >   C.  For multi-line keys, we strictly use the 
> >       quoted form.  (for readability only)
> 
> OK by me. (I think :) Oren?

There's no technical reason for this restriction, but I can live with it.

> > If I use the same keys, they are meant as the proper
> > replacement.

So repeating a key in a map means overwriting the value? I'd rather not.
Some errors will go undetected, and round-tripping will lose data. Any good
use case for this?

Other points you raised:

> I still would like to be able to:
> nextline2::
>     "Multi-line quoted scalars are rather easy to do.  It \
>     is simply a quoted \"inline\" scalar which happens \
>     to extend multiple lines.  It needs no special treatment. \
>     And it removes the exceptional case from the multi-line \
>     unquoted, unescaped, : scalar.\n"

I agree.

> > | standalone reference:*001
> > | empty map:!.map
> > | empty list:!.list
> > | another null:!.null
> > 
> > thoughts::
> >     I'm uncomfortable with this since I'd like
> >     each node to have a kind and a type in the
> >     information model.  Kind = map, list, scalar
> >     And type is user defined.  More thoughts here.
> >     I have a hunch that merging them is dangerous.
> 
> I'm inclined to agree with you. I was just showing that if 
> you open this up for empty lists and maps, you really need
> to extend it for all types.
> 
> So if we take this away, how do we do empty lists and maps? Oren?

I agree with Clark that kind != type, and should be kept strictly
independent of it, that is that !<type> does not reflect on the kind of
node, just what it is being de-serialized into. I also think that my
original proposal is consistent with both statements :-)

That's because when talking about an empty map or empty list is talking
about its type, not its kind! The semantics of:

    empty map : !map!

In my proposal is:

- This is a *scalar node*.
- Its value is the empty string.
- Its explicit type is "map!".
- The only valid value for this type is the empty string.
- This value is deserialized into an empty map of the same native data type
as what untyped maps are deserialized into.

There's no magic at all involved, and I don't see that we need to add any. I
learned that lesson well from my reference syntax fiasco.

That said, I agree with Brian that we should have a "text!" type along with
"map!", "list!" and "ref!". I disagree that null, binary, block and (gasp)
chomped-block belong to that set (I think Clark shares my view in this).

Therefore:

> > | adjustments from other emails:
> > |     :
> > |         existing:!com.clarkevans.Person
> > |             first: Clark
> > |             last: Evans
> > |         missing:!com.clarkevans.Person
> > |     :
> > |         That's it actually :)
> > 
> > notes:: 
> >     This is still problematic -- as missing is ambiguous.
> >     It could be a map, list, or single-line scalar.
> >     More thought on this is required.
>
> It looks like we need something like '%' and '@' to mean 
> (*only*) empty map or list.

Why? 'missing' is clearly a:
- Scalar node.
- With value: "".
- Type: com.clarkevans.Person.

Now, a well behaved "com.clarkevans.Person" de-serialization mechanism could
treat it as an error. That's perfectly OK. Or it could accept the empty
string as a way to indicate "run the default constructor". Or perhaps there
is a constructor taking a string value and the above is perfectly legal
already. Either way, I don't see the problem.

I don't see the need for:

> missing map based object:!com.clarkevans.Person%

What does "map based object" mean? It can't mean "deserialized into an empty
map" because it should be deserialized into a "com.clarkevans.Person"
object.

Clark also wrote:
> > For implicit types ::
> > 
> >     My initial thoughts is that this only works
> >     with one-line, non-quoted scalars.  There should
> >     be a registry with the REGEX for the implicit
> >     type and if it matches, the type is set.  
> > 
> >     The open issue is if this REGEX list is a singleton
> >     (on the yaml.org site) or if it can be local.  If
> >     it is local, then portability suffers as two vendors
> >     may have a two different types for the same regex;
> >     or worse, overlapping regex.  In this case, yaml
> >     fragments from both vendors can't be mixed.  This
> >     would greatly hurt YAML.  Thus, IMHO, the REGEX
> >     list should be a singelton and published in a spec.
> 
> I agree with all of this. Oren?

I don't see how having incompatible implicit types is any better or worse
then having incompatible explicit types. I'm all for a central registry for
"public" implicit types. If you want to use an unregistered type, feel free
to do so and face the consequences. That's the exact equivalence of using
"foo" as an explicit type, knowing it may clash with my "foo" type.

I say it is the application's developer's call. Either that or restrict all
explicit types to be DNS or IANA based (let's not go there, please). I don't
see any sense in having a "double standard" here.

Brian Ingerson [mailto:in...@tt...] also said:
> > There are still other things I want to talk about like:
> > 
> > things:
> >     : implicit types
> 
> I want them.
> 
> These should be the same (on systems that support integers :-)
> 
> implicit integer: 42
> explicit integer:!.int 42

Yes, specifying the explicit name for an implicit type doesn't change
anything (that's already true in today's draft). I don't know if we should
call it '.int', though - I see little sense in wasting a short, well known
name on something so rarely used.

> >     : special keys
> 
> These should be the same.
> 
> object1:!myclass
>     foo: bar
> object2:
>     !: myclass
>     foo: bar

Note that
	!: my class

Is only valid *serialized* YAML if you allow single-character indicator
keys. There's really very little sense in allowing this, and it causes
ambiguities in the parser. Why not always *serialize* it in the proper
(object1) way?

> >     : starting productions

Here there are actually two issues:

- Using a sequence of blank lines to separate top-level entries.

This does save you from using the '-' prefixes. Is it that important? It
seems to me that typically you'll know that a document will contain multiple
entries and you can therefore have a list top-level production. I can live
with it either way, though.

BTW, how much of this is a problem with sequential reading the entries, even
though you are using a "read ALL into memory" API? Couldn't most of this be
resolved by having an option to your parser to read just the next top-level
list entry/key instead?

- Handling the top-level scalar production.

The problem is that a top-level quoted scalar can't be distinguished from a
top-level map whose first key is a quoted scalar (not without lookahead,
anyway, and we just got rid of that). Perhaps we'll have to require that
such a scalar always be "next line":

\
"top level
multi-line
scalar"

vs.

"top level
multi-line
key" : value

There's no problem with unquoted scalars and block scalars.

> >     : indentation characters (tabs anybody?)
> 
> Use one tab, instead of four spaces.
> 
> Someone else suggested this and it makes a lot of sense in many ways.
> Why not just have one tab for each indentation level?

That was me, and Clark killed it by quoting the guy who wrote Make saying
"this was the biggest mistake he made" or some such. Perhaps we should
revisit this - as you say, a single tab character for indentation has a lot
of advantages.

> BTW, I am still very much against allowing tabs *or* spaces for
> indentation. We need to pick one. I just happen to be in favor of tabs
> at the moment. 

Agreed.

> >     : round-tripping comments
> 
> Something like:
> 
> this:
>     #: a comment
> 
> same as:
> 
> this:# a comment
>
> Thoughts? I probably care least about this point, but it 
> still is something we should agree on.

I'd rather not have an in-line form of keys in a map... Typically a comment
would be multi-line anyway.

> >     : throw-away comments 
> 
> Unquoted lines beginning with '#' in column 1 should be ignored by the
> parser. This will be so helpful, I consider it a slam-dunk. 
> (People are already requesting this)
> 
> Could be implemented as a standard filter.

I guess we can't avoid this. The problem is that it ruins YAML-pretty-print
as a standard YAML application... Sigh. Perhaps it could be written using
the standard streaming API...

> >     : DWIM modes
> 
> Maybe we don't need them with all this great new stuff.

Good. :-)

> >     : preprocessors/filters
> 
> : space to tab indentation fixer
> : comment stripper
> : validator
> : blank line ignorer

Ideally, I'd consider these to be out of scope for the YAML draft itself.
Otherwise, there would be no guarantee that my YAML processor will read your
output YAML file, because I don't use the right filter.

Which doesn't mean to say that your HTTP server isn't allowed to strip
comments before reading the file. It just should be made clear this isn't a
YAML file until the comments were stripped, that's all.

Perhaps the best way to handle all of these is as "transfer encodings". For
example, a "transfer encoding" stripping away comments is a very useful
thing, not only for YAML.

> >     : YAML standard library (MIME etc)
> 
> I guess this is a library of preprocessors, filters and 
> postprocessors that
> come standard. This let's you have a bunch of standard usage 
> options that
> aren't in the core info model, but that people like/need nonetheless.

Again, I see a danger in me not being able to read your file because I don't
have the right filter.

To summarize:
0. Let's get rid of '@' and '%' as my original proposal.
1. Please don't use ':' instead of ' ' to separate "descriptors".
2. We can use ':' instead of '\' as the next-line scalar indicator (though
I'd rather keep '\').
3. Let's keep '~' as an implicit type - not a descriptor. Likewise for
'*<anchor>'.
4. We can add in-line block values (I've no strong opinion).
5. Let's use '-' instead of ':' for list entries (unless someone feels
strongly against it).
6. Let's have explicit type names available for text as well as map and
list.
7. Let's keep the special type names as '<type>!' instead of '.<type>'.
8. We can add a '^<index>' descriptor for sparse lists (I've no strong
opinion).
9. We can restrict unquoted keys to one line (I've no strong opinion).
10. There's no need for a magical empty map/list descriptor beyond the
explicit map/list type name.
11. Let's have an *optional* central registry for implicit types.
12. There's no need to be able to serialize the ! and & keys.
13. We can have multiple line breaks indicate a top-level list (though I'd
rather use a list syntax).
14. Let's reconsider using tabs for indentation.
15. Let's not have a special in-line key/value pair syntax for '#'.
16. Let's consider having throw-away comments and other filters as "transfer
encoding".

Whew. That was a long one...

Have fun,

	Oren Ben-Kiki

Thread: [Yaml-core] Oren's Take

yaml-core