Thread: [Yaml-core] Re: New Draft

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The site is updated.

On Tue, Jun 12, 2001 at 11:12:37AM +0200, Oren Ben-Kiki wrote:
| Here's the new draft. I hope you like it.

Clark C . Evans [mailto:cc...@cl...] wrote:
> | address: @=
> |    First address
> |    Additional address
> 
> This as opposed to?
> 
>   address: @
>      First address
>      Additional address
> 
> I am curious what the difference between
> these would be?

At the parser API level, the 'asScalar' method will return null for '@' and
the first element for '@='.

At the native-language data structures level, '@' is de-serialized into a
native list. This is an important use case, both Brian and I agree that it
is vital to be able to do that. '@=' is de-serialized to some data structure
wrapping/extending a native list, which simulates a scalar value. Thus for
the '@=' case it would be possible, in Perl, to write both:

  print $YAML_DOC->{address}; # Prints 'First address'
  print $YAML_DOC->{address}->[0]; # Likewise.

The point is that doing this has a non-trivial cost (performance,
complexity, etc.). There's no point in *requiring* that cost for *each and
every list*. For maps we are OK - the extra cost only occurs if one
explicitly asks for it (by providing a '=' key or by coloring a scalar
value). For lists we need some other way of "explicitly asking for it"...
hence the '@=' syntax.

> Hmm.  This is interesting. I think I'll need
> some "implementation experience" before we 
> should move in this direction (and even 
> implementing asScalar... )  Thoughts?

At the parser API level, this is a trivial change. At the native data
structures level, Brian thought the basic concept was acceptable for maps, I
don't see why it can't be applied to lists as well. Brian?

I'd rather push on and finalize the spec; given the level of implementer
interest we've gathered, if we don't do that, we'll end up with "legacy"
YAML subsets we'll have to deal with... Speaking of implementations, I'll
have more free time available as of July 8th, and I promise to spend some of
it on the Java implementation.

Another implementation question: we've added the distinction between a
'binary blob' and 'Unicode text' to the data model - can this be supported
in some way when mapping YAML to native data structures of Perl/Python? (it
does works for Java).

Have fun,

    Oren Ben-Kiki

On Thu, Jun 14, 2001 at 04:15:55PM +0200, Oren Ben-Kiki wrote:
| At the parser API level, the 'asScalar' method will return null
| for '@' and the first element for '@='.

I must say... I don't think I like the proposal 
all that much.  At the parser level, there 
is nothing wrong with "asScalar" to be called
on a list or a map and have it return the
first or default value, as appropriate.

| At the native-language data structures level, '@' is 
| de-serialized into a native list. This is an important 
| use case, both Brian and I agree that it is vital to 
| be able to do that.

Agreed.  And for this case, "asScalar" could be
implemented as an external, friend function.  Thus,
those applications which would like to enable the
forward-compatibility would have to call asScalar
on every object that they *assume* is a scalar.

Thus, for the native-language data structure level,
asScalar() is an optional level of complexity
(which may not be possible in some bindings).

| '@=' is de-serialized to some data structure wrapping/extending 
| a native list, which simulates a scalar value. Thus for
| the '@=' case it would be possible, in Perl, to write both:
| 
|   print $YAML_DOC->{address}; # Prints 'First address'
|   print $YAML_DOC->{address}->[0]; # Likewise.

Ok.  As I see this, with the asScalar() proposal, the
first line would be...

    print asScalar($YAML_DOC->{address});  # Prints 'First address'

This requires "extra" effort for those applications
which want to enable the forward-compatible feature, 
but does not burden other applications.

| The point is that doing this has a non-trivial cost (performance,
| complexity, etc.). There's no point in *requiring* that cost for 
| *each and every list*.

I'd rather put the complexity in the applications
which need to be "forward compatible" rather than
in the data...

| For maps we are OK - the extra cost only 
| occurs if one explicitly asks for it (by providing 
| a '=' key or by coloring a scalar value).

I don't get this... how are maps any different
than lists in this respect?

| I'd rather push on and finalize the spec; given the level of 
| implementer interest we've gathered, if we don't do that, 
| we'll end up with "legacy" YAML subsets we'll have to deal with...

Right.  I'd like to hear your opinion on the above.

| Speaking of implementations, I'll have more free time available 
| as of July 8th, and I promise to spend some of
| it on the Java implementation.

Great.

| Another implementation question: we've added the distinction
| between a 'binary blob' and 'Unicode text' to the data model - can 
| this be supported in some way when mapping YAML to native data
| structures of Perl/Python? (it does works for Java).

In python there is not a "byte" data type.  There is, however,
a distinction between unicode and non-unicode strings.  I 
believe that perl may be in the same boat.

Best,

Clark

Clark C . Evans wrote:
> | At the parser API level, the 'asScalar' method will return null
> | for '@' and the first element for '@='.
> 
> I must say... I don't think I like the proposal 
> all that much.  At the parser level, there 
> is nothing wrong with "asScalar" to be called
> on a list or a map and have it return the
> first or default value, as appropriate.

Well, agreed. We could easily do that. I think it would be inconsistent with
the rest of the spec, though. There is a difference between "just a list"
and "a list which has a default value", from a modeling point of view. This
difference is apparant at least in one API (the "native data structures").
For consistency, I think this difference should be apparent in all APIs...

> | '@=' is de-serialized to some data structure wrapping/extending 
> | a native list, which simulates a scalar value. Thus for
> | the '@=' case it would be possible, in Perl, to write both:
> | 
> |   print $YAML_DOC->{address}; # Prints 'First address'
> |   print $YAML_DOC->{address}->[0]; # Likewise.
> 
> Ok.  As I see this, with the asScalar() proposal, the
> first line would be...
> 
>     print asScalar($YAML_DOC->{address});  # Prints 'First address'
> 
> This requires "extra" effort for those applications
> which want to enable the forward-compatible feature, 
> but does not burden other applications.

Ah, but the whole point was not to have to do anything special to get this
forward compatibility! If you could predict in advance which pieces of your
schema will evolve in what direction, you could have simply allowed for it
in the first place, or write your own logic to handle the multiple forms.

> I'd rather put the complexity in the applications
> which need to be "forward compatible" rather than
> in the data...

This, IMVHO, would bury the forward compatibility concept. People won't
litter their code with 'asScalar' all over the place, which means that the
moment you do evolve your schema, the code will break. Remember that this
may be something as trivial as adding a comment or a class to a scalar
value...

And anyway, the complexity belongs exactly in the data. There is a
difference between, say:

    address: @=
       First (default) address
       Alternate1 address
       Alternate2 address

And:

    installed applications: @
        Microsoft Word
        WinZip
        Perl

In the first case, the first entry is special. In the second case, it isn't.
This difference should, I think, be reflected in the data model. Using '@='
allows us to model this difference, API issues aside.

> | For maps we are OK - the extra cost only 
> | occurs if one explicitly asks for it (by providing 
> | a '=' key or by coloring a scalar value).
> 
> I don't get this... how are maps any different
> than lists in this respect?

Simple. For a map we already have a syntax to distinguish between a map such
as:

    installed applications: %
        Microsoft Word: 2000
        WinZip: 8.0
        Perl: 5

And:

    price: %
        =: 15
        valid-until: 01-JAN-2002

In the first case, all entries are equal. In the second case, the '=' entry
is special. It provides "the value of the map", just like in the addresses
list the first one provided "the value of the list". The difference is that
for the map case, all the special syntax we need is the special '=' key. For
the list case, we can't do that.

> | I'd rather push on and finalize the spec; given the level of 
> | implementer interest we've gathered, if we don't do that, 
> | we'll end up with "legacy" YAML subsets we'll have to deal with...
> 
> Right.  I'd like to hear your opinion on the above.

I hope the above answers it. It is more then an API issue, it is a modeling
issue (inspired by the color idiom).

> | Another implementation question: we've added the distinction
> | between a 'binary blob' and 'Unicode text' to the data model - can 
> | this be supported in some way when mapping YAML to native data
> | structures of Perl/Python? (it does works for Java).
> 
> In python there is not a "byte" data type.  There is, however,
> a distinction between unicode and non-unicode strings.  I 
> believe that perl may be in the same boat.

A distinction between unicode and non-unicode stings would be enough. Brian
could clear up whether Perl supports that (at least in the later versions).

I'll try to advance the draft this weekend. Please go over the list of
issues I gave at the end of the current one and send me your thoughts, so I
will be able to settle as many of them as possible...

Have fun,

    Oren Ben-Kiki

On Thu, Jun 14, 2001 at 06:49:47PM +0200, Oren Ben-Kiki wrote:
| For consistency, I think this difference should be apparent 
| in all APIs...

Well, for the "pull" based API, if you ask for a node
using the "next" method, it will return the next node
which contains, among other things, the node type.
At this point, "read" function can be used to read
the "scalar value".  This read function should be
useable regardless of the node type.  If it is 
a scalar, then read works as expected.  If the node
is a map or list, then read returns the default or
first scalar's value as appropriate.

For the "push" interface, if the visitor has a null
begin/end function pair, then a scalar value should
be pushed to the visitor.   Thus, both push/pull 
sequential interfaces have this capability built-in 
without an additional functions.

| > Ok.  As I see this, with the asScalar() proposal, the
| > first line would be...
| > 
| >     print asScalar($YAML_DOC->{address});  # Prints 'First address'
| > 
| > This requires "extra" effort for those applications
| > which want to enable the forward-compatible feature, 
| > but does not burden other applications.
| 
| Ah, but the whole point was not to have to do anything special to get this
| forward compatibility! If you could predict in advance which pieces of your
| schema will evolve in what direction, you could have simply allowed for it
| in the first place, or write your own logic to handle the multiple forms.

Hmm... I'd write code like this, think of asScalar 
as a cast operator.  If it already is a scalar, then
it returns itself, otherwise it fetches the default
value (for map) or the first value (for list).

If you want the "forward compatibilty" and you
want to use the native data structures... then 
you must use the cast.  This isn't so bad, is it?

| > I'd rather put the complexity in the applications
| > which need to be "forward compatible" rather than
| > in the data...
| 
| This, IMVHO, would bury the forward compatibility concept. People won't
| litter their code with 'asScalar' all over the place, which means that the
| moment you do evolve your schema, the code will break. Remember that this
| may be something as trivial as adding a comment or a class to a scalar
| value...

Ok. So this "hybrid" class would "inherit" from 
both scalar and map/list?  I'm not sure this is
workable:

For python, a string and a list are both sequences 
and share many of the same functions. Thus, you really 
can't have a hybrid class that is both a list and a 
string at the same time beacuse the shared sequence 
interface will cause a problem.

For Java, in particular, is the String class and the 
Map or List class have disjunction of members functions?  
If not, then I don't see  how it will work.  

I don't know enough about Perl, but if the map/list
methods arn't completely disjoint from the scalar
methods then this hybrid approach won't work.

Or... am I missing something?

|     price: %
|         =: 15
|         valid-until: 01-JAN-2002

Ok.  So when ever a map has an = sign, it will used
the MapScalar class (a special YAML class) rather
than the native Map?  

Not that I don't like the idea -- it just is that
the sequential interface doesn't need this distinction
and the implementation with a native interface isn't
all that clear, and may not even be implementable.

Not really (unfortunately), since regular Python strings 
will then be encoded using Base64 if we key off this 
distinction.  Yikes.  The entire issue YAR had transfers
upwards into the language itself since it does not have
a byte/character distinction.   Not to be an irritant, but 
perhaps we should re-think this a bit....

Perhaps we need to back to the unicode/non-unicode
distinction?  Using " " for all unicode values, 
where simple and block scalars are for 7-bit ASCII?

| I'll try to advance the draft this weekend. Please go over the list of
| issues I gave at the end of the current one and send me your thoughts, so I
| will be able to settle as many of them as possible...

Will do.

Clark

On Thu, Jun 14, 2001 at 06:49:47PM +0200, Oren Ben-Kiki wrote:
| Please go over the list of issues I gave at the end of the 
| current one and send me your thoughts...

- Relationship with MIME 

  Other than using Base64, there is no longer any 
  relationship with MIME.

- Character Escapes (within a quoted scalar)

  Ok.  Let us take the more complcated route.  Once a lookup
  table is created, it is a trival amount of code to add
  the remaining entries.

  \space     Single space ( )
  \\         Backslash (\) 
  \"         Double quote (") 
  \a         ASCII Bell (BEL) 
  \b         ASCII Backspace (BS) 
  \f         ASCII Formfeed (FF) 
  \n         ASCII Linefeed (LF) 
  \r         ASCII Carriage Return (CR) 
  \t         ASCII Horizontal Tab (TAB) 
  \v         ASCII Vertical Tab (VT)

  \uxxxx     Character with 16-bit hex value xxxx (Unicode only) 
  \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only) 
  \xhh       ASCII character with hex value hh 

- List Scalar Prefixes 

  You seem to really want this one... it's ok if it is optional.

- Anchor Semantics 

  Hmm.  I have not thought of leading zeros being
  automagically trimmed before comparison.  This
  sounds fair enough.

  ...

  On a similar line of thought, I was thinking about 
  "external anchors" which can can refer to other YAML texts 
  via http, https, and ftp protocols.  This need not be 
  implemented right away, but this is more or less what 
  I was thinking...  *(http://www.clarkevans.com/text.yaml)
  I'd rather not have this be a "arbitrary URI", but have
  it a URL that actually can be resolved to a given YAML
  text.  i.e., it should be concrete and expected that the
  parser can go out and fetch the entity if it so chooses.
  For now... let's leave this out and add it only if
  it is needed.

- API details 

  | How the API handles the sequence of maps; push/pull issues; 
  | mapping to scripting languages of interest (Python, Perl); etc. 

  Let's work on our first implementations before this
  gets pinned down further.

- Color Idiom

  | It is possible to allow for schema evolution, attachment 
  | of comments handling unknown classes and many other use 
  | cases by supporting the color idiom. 

  Right.  This is the current thread.  I think a full section
  describing this (regardless of how it is implemented) would
  be very useful.

- Comment Attribute

  | A textual comment may be attached to each node, similar to the class
  | attribute. It is probably best to do this as part of the color idiom,
  | but it could concievably be added on its own. 

  I like using the color idiom.  Perhaps "reserving" particular
  keys for use...  hmm.

Best,

Clark

Hi.

> And anyway, the complexity belongs exactly in the data. There is a
> difference between, say:
>
>     address: @=
>        First (default) address
>        Alternate1 address
>        Alternate2 address
>
> And:
>
>     installed applications: @
>         Microsoft Word
>         WinZip
>         Perl
>
> In the first case, the first entry is special. In the second
> case, it isn't.
> This difference should, I think, be reflected in the data model.
> Using '@='
> allows us to model this difference, API issues aside.

FWIW, I think this is a much more consistent approach and would like to
extend it to maps as I'll describe below.

> Simple. For a map we already have a syntax to distinguish between
> a map such
> as:
>
>     installed applications: %
>         Microsoft Word: 2000
>         WinZip: 8.0
>         Perl: 5
>
> And:
>
>     price: %
>         =: 15
>         valid-until: 01-JAN-2002
>
> In the first case, all entries are equal. In the second case, the
> '=' entry
> is special. It provides "the value of the map", just like in the addresses
> list the first one provided "the value of the list". The
> difference is that
> for the map case, all the special syntax we need is the special
> '=' key. For
> the list case, we can't do that.

I would suggest that if the '=' key is special that it not be treated like
the non-special keys. I like Oren's @= proposal but would prefer to see maps
follow suit with a %= construct.

My reasoning for this is that I don't believe that the default value should
be considered just another pair in the map (as Oren said above, it's
special). If we're iterating over the pairs in deserialized YAML map, I
would be surprised to see a pair with a key called "=" if my application
didn't put it there.

So my first thought was to follow the list's pattern and indicate the
default value using a non-pair as the first child like this:

price: %=
  15
  valid-until: 01-JAN-2002

Here the first child of the map-with-default-value isn't a pair but rather
just a node (of any type) much like the list-with-default-value.

I don't like this because it requires that we repeat the default value if it
was already a member of the map. The same problem also occurs with the
current default value mechanism, though. Consider a person map:

person: %
  name: Jason Diamond
  email: ja...@in...

If I wanted to make the value of the "name" pair be the default value, I'd
have to repeat it like this:

person: %
  =: Jason Diamond
  name: Jason Diamond
  email: ja...@in...

or like this using my proposal from above:

person: %=
  Jason Diamond
  name: Jason Diamond
  email: ja...@in...

But I'd rather just do this:

person: %=
  name: Jason Diamond
  email: ja...@in...

Now the first child is the default value (just like lists). I know that maps
are unordered (conceptually) but serialization requires that we order them
and we can actually take advantage of that. When deserializing, all we need
to do is keep track of the key of the first child pair and then get the
value for that key whenever the default value is requested. When
serializing, if there's a default value, we output that first and then the
rest (in alphabetical order minus the default value pair if we need to).

The harmony between this and Oren's new list-with-default-value construct
really appeals to me. Not repeating the data and allowing the default value
to use an application-specific key is also a huge advantage, in my opinion.

Jason.

On Thu, Jun 14, 2001 at 11:21:03AM -0700, Jason Diamond wrote:
| But I'd rather just do this:
| 
| person: %=
|   name: Jason Diamond
|   email: ja...@in...
| 
| Now the first child is the default value (just like lists).

Ok.  I actually like having the first entry in the 
map be the default value just like a list.  Nice.
This is much better since it doesn't require the 
"special" map key, equal(=), which someone may 
actually use one day as part of real content...
Having the first entry be the default, just like
lists, is very appealing.

| The harmony between this and Oren's new list-with-default-value
| construct really appeals to me. Not repeating the data and 
| allowing the default value to use an application-specific key 
| is also a huge advantage, in my opinion.

Ok. I do like the "first entry" for both maps and lists.  Sold.

...

However, I still have the problem with the %= vs % and
@= vs @ distinction.  How can this distinction be 
supported in the API?  I don't think it can very 
easily without loosing transparency (via asScalar)
and given that fact, having the two variants isn't
all that helpful, is it?

Mabye it is a useful distinction.  Mabye it says that 
it is "ok" to treat use the first value of the sequence
(map or list) if a scalar is expected.  Hmmm.

Best,

Clark

Clark C . Evans wrote:
> | For consistency, I think this difference should be apparent 
> | in all APIs...
> 
> Well, for the "pull" based API, ...
> For the "push" interface, ...

I'm not arguing you *can't* hide the difference, I'm questioning whether you
*should* hide it, from a purely *modeling* point of view. That is, do we or
don't we want to make explicit the fact that for some lists/maps there
really is a default value, and for others there simply is no such thing?

BTW, I like Jason's idea of "first key" and '%=' to indicate that
difference.

> | Ah, but the whole point was not to have to do
> | anything special to get this
> | forward compatibility! If you could predict in
> | advance which pieces of your
> | schema will evolve in what direction, you could
> | have simply allowed for it
> | in the first place, or write your own logic to
> | handle the multiple forms.
>
> Hmm... I'd write code like this, think of asScalar 
> as a cast operator.  If it already is a scalar, then
> it returns itself, otherwise it fetches the default
> value (for map) or the first value (for list).

You might :-) Most people won't.

> If you want the "forward compatibilty" and you
> want to use the native data structures... then 
> you must use the cast.  This isn't so bad, is it?

As a last ditch solution, yes. But I'd much rather have it be completely
transparent.

> Ok. So this "hybrid" class would "inherit" from 
> both scalar and map/list?

Sort of. In Perl at least, I expected that it would simply have some way of
detecting of whether it is used in a scalar or list context, and act
accordingly.

> I'm not sure this is workable:

I agree it is shaky.

> For python, a string and a list are both sequences 
> and share many of the same functions. Thus, you really 
> can't have a hybrid class that is both a list and a 
> string at the same time beacuse the shared sequence 
> interface will cause a problem.

And there's no "scalar context" to help you out. Hmmm. Good point.

> For Java, in particular, is the String class and the 
> Map or List class have disjunction of members functions?  

In Java it would be easier. 'asScalar' would be the standard way of getting
a value out of a YamlNode, period. In Perl/Pythin, we really want to be able
to access the value directly, however...

> Or... am I missing something?

Probably I am. This needs more thought. I'll brood on it some more...

Have fun,

    Oren Ben-Kiki

I wrote:
> > Ok. So this "hybrid" class would "inherit" from 
> > both scalar and map/list?
> ...
> > I'm not sure this is workable:
> 
> I agree it is shaky.
> ... I'll brood on it some more...

OK, here's an idea.

Instead of having a class which is both a string and a map/list, which is
impractical, we'll have a class which is a string with one additional
method/member - let's call it '_' (reason below). So:

Original YAML file: %
   address: Default Address
   size: 12.5
Perl code: @
    print $doc->{address}; # Print 'Default address'.
    print $doc->{size}; # Print '12.5'.

Evolved YAML file: %
    address: @=
        Default address
        Additional address
    size: %=
        =: 12.5
        accuracy: 0.5
Perl code: @
    print $doc->{address}; # Print 'Default address'.
    print $doc->{address}->{_}->[0]; # Likewise
    print $doc->{address}->{_}->[1]; # Print 'Additional address'
    print $doc->{size}; # Print '12.5'.
    print $doc->{size}->{_}->{'='}; # Likewise
    print $doc->{size}->{_}->{accuracy}; # Print '0.5'

Incompatible YAML file: %
    address: @
        Default address
        Additional address
    size: %
        =: 12.5
        accuracy: 0.5
Perl code: @
    print $doc->{address}->[0]; # Print 'Default address'
    print $doc->{address}->[1]; # Print 'Additional address'
    print $doc->{size}->{'='}; # Print '12.5'
    print $doc->{size}->{accuracy}; # Print '0.5'

(Isn't it lovely that the above is valid YAML? :-) At any rate, as a schema
owner, when you evolve the schema, you have two choices.

- If you care about existing code and data, you do it via '@=' and '%='.
Existing code goes on working, new code is slightly more cumbersome (an
additional '->{_}' for Perl, or '._' for Python). Old data need not be
modified.

- If you don't care about existing code and data, you do it via '@' and '%'.
Existing code is modified to access the proper key/index for the value (but
there's no '->{_}' or '._'). Data is modified to convert the scalar into a
list or a map.

In either case, if someone edits a file and adds some color you don't know
and don't want to know about, your code will go on working. For example,
suppose you have the configuration file:

   my-machine: 192.168.1.17

And someone goes into the file to change it to:

   my-machine: %=
      =: 192.168.1.17
      #: DON'T CHANGE THIS! several clients
         access this address directly - talk
         with the Network Administrator first.

Then you don't have to worry about the code reading this file. It will just
go on working.

So, the rule is - for maps/lists with '%='/'@=', there is a "default value"
(first key/entry); the native data structure is the same as the default
value's data structure (typically a string), with the addition of a
'->{_}'/'._' syntax for accessing the "color" - the full map/list.

Note the above rule works in case the default value is a non-scalar (a list
or a map):

Old file:
    text: string
    map: %
       key: value
    list:
       entry

New file:
    text: #"With color comment" string
    map: %=
       main: %
          key: value
       color:
          some: thing
    list: @=
       @
          entry
       color

$old_doc->{text} == 'string';
$old_doc->{map}->{key} == 'value';
$old_doc->{list}->[0] == 'entry';

$new_doc->{text} == 'string';
$new_doc->{text}->{_}->{'='} == 'string';
$new_doc->{text}->{_}->{'#'} == 'With color comment';
$new_doc->{map}->{key} == 'value';
$new_doc->{map}->{_}->{main}->{key} == 'value';
$new_doc->{map}->{_}->{color}->{some} == 'thing';
$new_doc->{list}->[0] == 'entry';
$new_doc->{list}->{_}->[0]->[0] == 'entry';
$new_doc->{list}->{_}->[1] == 'color';

That's why I chose '_' - we can say it is reserved so it won't collide with
any normal map keys, and it can still be used as an identifier, so in
Python/JavaScript one could write the above code as:

old_doc.text == 'string';
old_doc.map.key == 'value';
old_doc.list[0] == 'entry';

new_doc.text == 'string';
new_doc.text._."=" == 'string';
new_doc.text._."#" == 'With color comment';
new_doc.map.key == 'value';
new_doc.map._.main.key == 'value';
new_doc.map._.color.some == 'thing';
new_doc.list[0] == 'entry';
new_doc.list._[0][0] == 'entry';
new_doc.list._[1] == 'color';

That's about as clean and powerful as you can get, I think. How about it?

Have fun,

    Oren Ben-Kiki

Oren's brain trust at work... *smile*

| Original YAML file: %
|    address: Default Address
|    size: 12.5
| Perl code: @
|     print $doc->{address}; # Print 'Default address'.
|     print $doc->{size}; # Print '12.5'.

Ok.  Here is same code using asScalar(): @
   print asScalar($doc->{address});
   print asScalar($doc->{size});

| Evolved YAML file: %
|     address: @=
|         Default address
|         Additional address
|     size: %=
|         =: 12.5
|         accuracy: 0.5
| Perl code: @
|     print $doc->{address}; # Print 'Default address'.
|     print $doc->{address}->{_}->[0]; # Likewise
|     print $doc->{address}->{_}->[1]; # Print 'Additional address'
|     print $doc->{size}; # Print '12.5'.
|     print $doc->{size}->{_}->{'='}; # Likewise
|     print $doc->{size}->{_}->{accuracy}; # Print '0.5'

Ok.  Although I must say... this could get
ugly especially as the depth increases.  It 
seems that this proposal moves the notion of
"forward-compatible" into a more elaborate
"backward-compatible".  Thus the burden is
placed on newer code... hmm.

| Incompatible YAML file: %
|     address: @
|         Default address
|         Additional address
|     size: %
|         =: 12.5
|         accuracy: 0.5
| Perl code: @
|     print $doc->{address}->[0]; # Print 'Default address'
|     print $doc->{address}->[1]; # Print 'Additional address'
|     print $doc->{size}->{'='}; # Print '12.5'
|     print $doc->{size}->{accuracy}; # Print '0.5'

Here is both of the above using the asScalar(): @
  print asScalar($doc->{address});
  print asScalar($doc->{size});
  print asScalar($doc->{address}->[0]); # Print 'Default address'
  print asScalar($doc->{address}->[1]); # Print 'Additional address'
  print asScalar($doc->{size}->{'='}); # Print '12.5'
  print asScalar($doc->{size}->{accuracy}); # Print '0.5'

| (Isn't it lovely that the above is valid YAML? :-)

Oh yea... it is.  Fancy that.  I didn't even notice!

...

| At any rate, as a schema owner, when you evolve the 
| schema, you have two choices.
| 
| - If you care about existing code and data, you do it via '@=' and '%='.
| Existing code goes on working, new code is slightly more cumbersome (an
| additional '->{_}' for Perl, or '._' for Python). Old data need not be
| modified.
| 
| - If you don't care about existing code and data, you do it via '@' and '%'.
| Existing code is modified to access the proper key/index for the value (but
| there's no '->{_}' or '._'). Data is modified to convert the scalar into a
| list or a map.

Ok.  Two approaches:

  A. asScalar()

     + External function, don't get complexity
       unless ask for it.
     + No changes to YAML needed.
     + Technique works with non-yaml maps/scalars,
       (ones that have not been saved yet).
     + Explicit cast...

     - Requires foresight for code to have the
       substutibility property.
     - Extra asScalar() fucntion littered 
       throught code.
     - Requires a specific "key" value, =, to
       be reserved.

  B. =% =@ approach

     + Does not require foresight, thus substituability
       is gaurenteed.
     + Does not require a specific key to be reserved.

     - Changes to YAML needed
     - Special object needed to support construct

I like the former, but I guess you all like the latter?

| In either case, if someone edits a file and adds some color you don't know
| and don't want to know about, your code will go on working. For example,
| suppose you have the configuration file:
| 
|    my-machine: 192.168.1.17
| 
| And someone goes into the file to change it to:
| 
|    my-machine: %=
|       =: 192.168.1.17
|       #: DON'T CHANGE THIS! several clients
|          access this address directly - talk
|          with the Network Administrator first.
| 
| Then you don't have to worry about the code reading this file. It will just
| go on working.
| 
| So, the rule is - for maps/lists with '%='/'@=', there is a "default value"
| (first key/entry); the native data structure is the same as the default
| value's data structure (typically a string), with the addition of a
| '->{_}'/'._' syntax for accessing the "color" - the full map/list.

Yep. Hmm.

Best,

Clark

Ok.  I still don't like option B beacuse it introduces 
two new types of nodes and special support classes 
required for each binding.   A isn't all that great
beacuse it requires "asScalar()" calls everwhere, 
just-in-case.

Let us add another option, C to the list.  This 
introduces a YamlNode which is similar to a DOM
node.  The user would then have the choice of 
binding, native (where substutibility doesn't hold)
or using our object model.

In option C, a YamlNode is introduced... this
has characteristics of a map, list, and scalar.
For this case, the =% and =@ are not needed.
At parse time, the user would request if they
want YamlNodes or native constructs.

...

In any case, I'm going to start plough ahead with
the C implementation... otherwise, I feel we won't
go anywhere.

P.S.  I was wondering why the "stab" production got
added and what it proports to solve.  

Best,

Clark

On Fri, Jun 15, 2001 at 12:35:36PM -0400, Clark C . Evans wrote:
| Ok.  Two approaches:
| 
|   A. asScalar()
| 
|      + External function, don't get complexity
|        unless ask for it.
|      + No changes to YAML needed.
|      + Technique works with non-yaml maps/scalars,
|        (ones that have not been saved yet).
|      + Explicit cast...
| 
|      - Requires foresight for code to have the
|        substutibility property.
|      - Extra asScalar() fucntion littered 
|        throught code.
|      - Requires a specific "key" value, =, to
|        be reserved.
| 
|   B. =% =@ approach
| 
|      + Does not require foresight, thus substituability
|        is gaurenteed.
|      + Does not require a specific key to be reserved.
|      
|      - Changes to YAML needed
|      - Special object needed to support construct
| 

Clark C . Evans [mailto:cc...@cl...] wrote:
> Thank you.  This is nice. Updated both SF and YAML.ORG

Thanks. BTW, the 'latest.html' file needs to be replaced by a copy of the
new draft... It now contains the 12 October version (ancient history :-)

> A few nitpicks...

I marked them all on my hard-copy, together with some other nit-picks I
found (such as a minor production bug). I'll put them in the next draft.

> 3. Throwaway comments... remove "sole purpose is to 
>    communicate among human maintainers".  As I'm sure
>    it will be (ab)used for many purposes, including
>    the unix #!/my/func trick.

How about I replace "sole purpose" to "usual purpose"? I want to really
stress the fact that YAML-wise these comments don't carry any data to the
application.

> General comments...
> 
> A. I like that the "indicators" follow the "properties",
>    it gives consistency between the quoted scalar 
>    and the block scalar -- the indicator ("|) follow the
>    descriptor (&!)

So do I, but Brian still feels uneasy about it. Brian, can you live with it?

> B. I'd still like to de-couple the "implicit syntax"
>    with the type system.  In particular, I'd like to
>    be able to specify...
> 
>    picture: !gif [base64 encoded image]
>    empty-map: !map ~
>    float: !float 0
> 
>    To enable this, we must de-couple the type system
>    from the implicit syntax mechanism.  Changes:

No. These are perfectly valid today. Well, except for the map case, and
that's because I made the decision that:

empty map: !map
looks better than: !map ~

The latter is confusing, it looks as if you are trying to create a map-typed
null, which is something which doesn't exist in any computer language I'm
aware off. An empty map, however, is a normal construct. At any rate it is a
matter of taste to some degree.

Back to the real issue, let me re-cast the current state of affairs in a
semi-formal way:

- There are three scalar styles.
- The YAML parser knows how to extract Unicode state from each, regardless
of type.
- The type property (implicit or explicit) specifies which "de-serializer"
to apply to the extracted text.
- The "de-serializer" is a function taking a text string and returning an
in-memory data object of an arbitrary in-memory data type. For all I care, a
single de-serializer could return objects of different classes, if that
makes sense.
- The "de-serializer" is free to interpret the text string as it pleases. In
particular, it can auto-detect several formats (dates are a good example).
It can perform base64 expansion (the "gif" type is a good example). And so
on.
- Implicit typing means choosing a de-serializer according to the value
matching the de-serializer's regexp pattern. Explicit typing means choosing
a de-serializer according to the de-serializer's name. Otherwise there's
absolutely no difference between them.

Consequences:

- A de-serializer can't change the way text is extracted from the file. That
is, one can't create an implicit type behaving like a block.
- Therefore it is possible to safely line-wrap, pretty-print etc. any
unquoted scalar, without worrying about types.
- Also, if a certain type looks best using a block style (e.g., a code
type), then it must be an explicit type:

code: !!Perl-script |
	#!/bin/perl
	print "Hello, world!\n"

- One can't do it with an implicitly type with the pattern '#!/bin/perl.*':

code: \
	#!/bin/perl
	print "Hello, world!\n"

So, one down side (no block-like implicit types) and one up-side (one
trivially knows when it is safe to line-wrap values, regardless of types).
Not a bad compromise overall, I think.

Other up-sides: we can define the unquoted type to be slightly magical so it
is safe in keys (disallowing ':' there) while it is still general (allowing
':' in values), etc.

Now, your alternative (if I understand it correctly):

- The parser does not know how to extract text from scalars. A syntax module
does it for him.
- Every syntax module has an independent line-folding, escaping etc. policy.
- Every syntax module has a default de-serializer (e.g. a base64 syntax
module has a default byte-array de-serializer).
- A de-serializer works much as above (turns a text string into an in-memory
data type).
- A de-serializer still needs to handle multiple formats (e.g., dates).
- A syntax module is chosen only according to the regexp pattern the scalar
matches.
- Implicit typing means using the syntax module's default de-serializer.
- Explicit typing means using the named de-serializer.

Correct?

Consequences: The opposite of the above. Since one can create syntax modules
which behave any way they please, then one must be very careful applying
line-wrapping etc. to scalar values. Generic YAML editors, pretty-printers
etc. will have to be syntax-module-aware to work safely. On the other hand,
this would allow us to define stronger implicit types:

html: <html><body><pre>
	Note that YAML will preserve
	line breaks and spaces here
	</html></body></pre>
but:	Not
	here!

A different tradeoff, certainly.

Other down-side: Defining unquoted scalars in a way that makes them safe in
keys becomes a challenge. It is not enough to specify for each syntax module
whether it is safe. Take the following one:

regexp: <alpha> .*
policy: line-folded, no escaping.
safe-in-keys: ???

It seems each syntax module may be invoked in two "modes", one for map keys
and one for values. It is the module's responsibility to detect the end of
the key in the first case, the parser's job to inform it of the end of the
value in the second.

Other consequence: This would make it possible to create whole new syntax
forms and embed them inside a YANL file. Take tables for example:

table: =table ('=table.*' is the regexp pattern for tables)
	A1 | A2 (line-break ends row)
	B1 | B2
	...

I'm not certain we want to encourage that. Such complex types are best
represented as YAML structures (to allow accessing sub-parts using YAML-path
etc.):

table:
	-
		- A1
		- A2
	-
		- B1
		- B2

Otherwise you'd quickly get thin YAML wrappers for very complex
application-specific types, and lose most of the advantages YAML gives you
in terms of interoperability, generic addressing of data items, etc.

To summarize, I see some merit in the concept, but I think the tradeoff is
against it. I consider type-independent line folding a big advantage of YAML
and I wouldn't want to give it up unless there's a significant gain, and I
don't see that gain yet.

Can you give some examples of how the extra power provided by this approach
may be put to good use?

Additional issues:

- I'd like to introduce for implicit types the concept of a "recognition
pattern" which is separate from the full pattern. For example, for binary,
the recognition pattern is '^\[=' while the full pattern is '\[=(base64
ending with =)\]'. This way the parser won't need unbounded lookahead to
handle "streaming" implicit types like "binary" (or "vector", or "matrix",
or whatever).

Have fun,

	Oren Ben-Kiki

On Mon, Nov 05, 2001 at 09:45:23AM +0200, Oren Ben-Kiki wrote:
| Thanks. BTW, the 'latest.html' file needs to be replaced
| by a copy of the new draft...

Done.

| How about I replace "sole purpose" to "usual purpose"? 

Great.

Brian?

| > B. I'd still like to de-couple the "implicit syntax"
| >    with the type system. 

Really?  This only works if the !gif class
was some-how associated to use the [base64
decoder mechanism.  Let's focus on this
use case -- how do we represent a base64 
encoded object with an explicit class?

...

Nice summary below Oren.  Let me chew on this one for
a few days and get my thoughts together.  I have the
idea of spliting the "style" or "encoding" of a scalar
from it's "type".   Note that we can 'fix' the number
of "styles" available within the specification.  I think 
the bulk of my concern is more or less over the 
productions and normalization.

| Back to the real issue, let me re-cast the current state 
| of affairs in a semi-formal way:
| 
| - There are three scalar styles.
|
| - The YAML parser knows how to extract Unicode state from 
|   each, regardless of type.
|
| - The type property (implicit or explicit) specifies which
|   "de-serializer" to apply to the extracted text.
|
| - The "de-serializer" is a function taking a text string
|   and returning an in-memory data object of an arbitrary 
|   in-memory data type. For all I care, a single de-serializer 
|   could return objects of different classes, if that makes sense.
|
| - The "de-serializer" is free to interpret the text
|   string as it pleases. In particular, it can auto-detect 
|   several formats (dates are a good example). It can perform 
|   base64 expansion (the "gif" type is a good example). And so on.
|
| - Implicit typing means choosing a de-serializer according
|   to the value matching the de-serializer's regexp pattern. 
|   Explicit typing means choosing a de-serializer according 
|   to the de-serializer's name. Otherwise there's
|   absolutely no difference between them.

I think the last point is the key.

| Consequences:
| 
| - A de-serializer can't change the way text is extracted 
|   from the file. That is, one can't create an implicit 
|   type behaving like a block.
| - Therefore it is possible to safely line-wrap, pretty-print 
|   etc. any unquoted scalar, without worrying about types.
| - Also, if a certain type looks best using a block style
|   (e.g., a code type), then it must be an explicit type:
| 
| code: !!Perl-script |
| 	#!/bin/perl
| 	print "Hello, world!\n"
| 
| - One can't do it with an implicitly type with the pattern '#!/bin/perl.*':
| 
| code: \
| 	#!/bin/perl
| 	print "Hello, world!\n"
| 
| So, one down side (no block-like implicit types) and one up-side (one
| trivially knows when it is safe to line-wrap values, regardless of types).
| Not a bad compromise overall, I think.
| 
| Other up-sides: we can define the unquoted type to be slightly magical so it
| is safe in keys (disallowing ':' there) while it is still general (allowing
| ':' in values), etc.

Nice summary.

| Now, your alternative (if I understand it correctly):
| 
| - The parser does not know how to extract text from
|   scalars. A syntax module does it for him.
|
| - Every syntax module has an independent line-folding,
|   escaping etc. policy.
|
| - Every syntax module has a default de-serializer (e.g. a 
|   base64 syntax module has a default byte-array de-serializer).
|
| - A de-serializer works much as above (turns a text string 
|   into an in-memory data type).
|
| - A de-serializer still needs to handle multiple formats (
|   e.g., dates).
|
| - A syntax module is chosen only according to the regexp 
|   pattern the scalar matches.
|
| - Implicit typing means using the syntax module's default
|   de-serializer.
|
| - Explicit typing means using the named de-serializer.

This is close (my ideas arn't solid though).  I'm trying 
to separate syntax module ("transfer encoding") from 
the type ("seralizer/de-seralizer pair")

Right.

Hmm.

| Additional issues:
| 
| - I'd like to introduce for implicit types the concept of a "recognition
| pattern" which is separate from the full pattern. For example, for binary,
| the recognition pattern is '^\[=' while the full pattern is '\[=(base64
| ending with =)\]'. This way the parser won't need unbounded lookahead to
| handle "streaming" implicit types like "binary" (or "vector", or "matrix",
| or whatever).

I think this was my ^ proposal.  Only that I was 
doing it as another indicator (which I think is 
a clean way to do it).  

Clark

Clark C . Evans [mailto:cc...@cl...] wrote:
yam...@li...
> | >
> | >    picture: !gif [base64 encoded image]
> |
> | These are perfectly valid today.
> 
> Really?  This only works if the !gif class
> was some-how associated to use the [base64
> decoder mechanism.  Let's focus on this
> use case -- how do we represent a base64 
> encoded object with an explicit class?

It isn't the gif class, it is the gif de-serializer. Which is welcome to
invoke the base64 routines internally.

!<thing> doesn't specify the *value class*, it specifies the *de-serializer*
class; of course there's usually a 1-1 mapping between these two... Adding a
paragraph somewhere which clarifies this distinction - the gif base64 image
is a good example case.

> ...
> 
> Nice summary below Oren.  Let me chew on this one for
> a few days and get my thoughts together.  I have the
> idea of spliting the "style" or "encoding" of a scalar
> from it's "type".   Note that we can 'fix' the number
> of "styles" available within the specification.  I think 
> the bulk of my concern is more or less over the 
> productions and normalization.

Some more thoughts, and to get the terminology straight:

- There's the "class" of the final in-memory value;
- There's the "type" string following the '!';
- There's the "de-serializer" mapping between them;
- There's the node "kind" (map/list/scalar);
- There's the scalar "style" (unquoted, quoted, block);
- There's the value "format" (the way it is represented as Unicode text).

The current minimal process is, today:

(data) scalar (kind)
->
(process) parser (according to style)
->
(data) unicode text
->
(process) deserializer (chosen implicitly or by explicit type; auto-detects
the format)
->
(data) object of right class.

The '!type1!type2' proposal allows one to chain de-serializers. The
'^<thing>' proposal means explicitly specifying the format so the
de-serializer doesn't have to auto-detect it. Both can achieve the same
thing:

!gif!base64 (base64 is a de-serializer taking Unicode text and giving byte
array; gif is a de-serializer taking byte array giving gif image). 

!gif ^base64 (gif is a de-serializer taking Unicode text giving gif image;
base64 is a possible format this de-serializer can handle).

The difference is subtle for the above case; the !<type1>!<type2> proposal
allows for arbitrary multi-stage processing (e.g., !ocr!gif!base64, as a
really wasteful way to transfer text), while the '^<format>' is restricted
for only two levels. So *if* we decide this is important, I'd rather go for
the pipeline proposal.

> | Back to the real issue, let me re-cast the current state 
> | of affairs in a semi-formal way:
> ...  I'm trying 
> to separate syntax module ("transfer encoding") from 
> the type ("serializer/de-serializer pair")

I think that the best way to do that is using pipelined '!<type1>!<type2'
types. The way it works is as follows:

!gif!base64 [base64 encoded image, gif de-serializer is given the binary
data from the base64 one]

!gif! (implicitly typed value, resulting type must be acceptable to the gif
de-serializer - e.g., an implicit base64 value will result in the same
behavior as above)

That gives you everything you need. However, I don't think we should add it
in YAML 1.0 (no proven need for this). Instead let's just allow for it in
the future. The current wording reserves all names not starting with a DNS
entry; let's make it stronger and reserve everything not looking like a DNS
entry "Regexp: ( (alnum or '-')+ '.' )+ ( alnunm or '-' )+". This means all
names containing '!' or other magical characters would be reserved.

This way we could add pipelined types later on if they are useful. Or any
other scheme we feel is necessary. Note this would be backward compatible. A
YAML 1.0 application would just have to be given the enumeration of all
possible combinations instead of creating a pipeline on the fly like a YAML
1.1 application would.

> | Can you give some examples of how the extra power provided 
> | by this approach may be put to good use?
> 
> Hmm.

I think this is the core of the issue. That's why I want to leave the door
open for it, but not add it into the current spec. Reserving all type names
with "magical characters" seems to do this neatly.

> | Additional issues:
> | 
> | - I'd like to introduce for implicit types the concept of a 
> | "recognition
> | pattern"...
> 
> I think this was my ^ proposal.

Yes; ^format removes the need for implicit detection. But it is useful to be
able to write

int: 12

So a "recognition pattern" is still needed. The question is whether
recognition == validation; I suggest it doesn't have to.

Have fun,

	Oren Ben-Kiki

Oren,

Thanks for working through this with me.  I like
the pipeline proposal, and agree we can put it
off to YAML 1.1  if we find use cases which
support the added complexity.

Clark

...

My proposal was trying to separate this distinction
and make ^ for the "seralizer" class and ! for the
value class.  Perhaps this distinction isn't good
at all... since a seralizer will create an object
of a given value class.

| !gif!base64 (base64 is a de-serializer taking Unicode
| text and giving byte array; gif is a de-serializer taking
| byte array giving gif image). 
| 
| !gif ^base64 (gif is a de-serializer taking Unicode text 
| giving gif image; base64 is a possible format this 
| de-serializer can handle).
| 
| The difference is subtle for the above case; the 
| !<type1>!<type2> proposal allows for arbitrary multi-stage 
| processing (e.g., !ocr!gif!base64, as a really wasteful 
| way to transfer text), while the '^<format>' is restricted
| for only two levels. So *if* we decide this is important, 
| I'd rather go for the pipeline proposal.

Right.

| !gif! (implicitly typed value, resulting type must be
| acceptable to the gif de-serializer - e.g., an implicit
| base64 value will result in the same behavior as above)

Nice.

| That gives you everything you need. However, I don't think
| we should add it in YAML 1.0 (no proven need for this). Instead 
| let's just allow for it in the future. The current wording 
| reserves all names not starting with a DNS entry; let's make 
| it stronger and reserve everything not looking like a DNS
| entry "Regexp: ( (alnum or '-')+ '.' )+ ( alnunm or '-' )+". 
| This means all names containing '!' or other magical characters
| would be reserved.

Ok.  The private area !!private should exclude "private" 
from containing a ! as well, to allow for chaining.

| This way we could add pipelined types later on if they are
| useful. Or any other scheme we feel is necessary. Note this 
| would be backward compatible. A YAML 1.0 application would 
| just have to be given the enumeration of all possible 
| combinations instead of creating a pipeline on the fly like
| a YAML 1.1 application would.

Sold.

Clark

Thread: [Yaml-core] Re: New Draft

yaml-core