Clark C . Evans [mailto:cce@...] wrote:
> Thank you. This is nice. Updated both SF and YAML.ORG
Thanks. BTW, the 'latest.html' file needs to be replaced by a copy of the
new draft... It now contains the 12 October version (ancient history :-)
> A few nitpicks...
I marked them all on my hard-copy, together with some other nit-picks I
found (such as a minor production bug). I'll put them in the next draft.
> 3. Throwaway comments... remove "sole purpose is to
> communicate among human maintainers". As I'm sure
> it will be (ab)used for many purposes, including
> the unix #!/my/func trick.
How about I replace "sole purpose" to "usual purpose"? I want to really
stress the fact that YAML-wise these comments don't carry any data to the
> General comments...
> A. I like that the "indicators" follow the "properties",
> it gives consistency between the quoted scalar
> and the block scalar -- the indicator ("|) follow the
> descriptor (&!)
So do I, but Brian still feels uneasy about it. Brian, can you live with it?
> B. I'd still like to de-couple the "implicit syntax"
> with the type system. In particular, I'd like to
> be able to specify...
> picture: !gif [base64 encoded image]
> empty-map: !map ~
> float: !float 0
> To enable this, we must de-couple the type system
> from the implicit syntax mechanism. Changes:
No. These are perfectly valid today. Well, except for the map case, and
that's because I made the decision that:
empty map: !map
looks better than: !map ~
The latter is confusing, it looks as if you are trying to create a map-typed
null, which is something which doesn't exist in any computer language I'm
aware off. An empty map, however, is a normal construct. At any rate it is a
matter of taste to some degree.
Back to the real issue, let me re-cast the current state of affairs in a
- There are three scalar styles.
- The YAML parser knows how to extract Unicode state from each, regardless
- The type property (implicit or explicit) specifies which "de-serializer"
to apply to the extracted text.
- The "de-serializer" is a function taking a text string and returning an
in-memory data object of an arbitrary in-memory data type. For all I care, a
single de-serializer could return objects of different classes, if that
- The "de-serializer" is free to interpret the text string as it pleases. In
particular, it can auto-detect several formats (dates are a good example).
It can perform base64 expansion (the "gif" type is a good example). And so
- Implicit typing means choosing a de-serializer according to the value
matching the de-serializer's regexp pattern. Explicit typing means choosing
a de-serializer according to the de-serializer's name. Otherwise there's
absolutely no difference between them.
- A de-serializer can't change the way text is extracted from the file. That
is, one can't create an implicit type behaving like a block.
- Therefore it is possible to safely line-wrap, pretty-print etc. any
unquoted scalar, without worrying about types.
- Also, if a certain type looks best using a block style (e.g., a code
type), then it must be an explicit type:
code: !!Perl-script |
print "Hello, world!\n"
- One can't do it with an implicitly type with the pattern '#!/bin/perl.*':
print "Hello, world!\n"
So, one down side (no block-like implicit types) and one up-side (one
trivially knows when it is safe to line-wrap values, regardless of types).
Not a bad compromise overall, I think.
Other up-sides: we can define the unquoted type to be slightly magical so it
is safe in keys (disallowing ':' there) while it is still general (allowing
':' in values), etc.
Now, your alternative (if I understand it correctly):
- The parser does not know how to extract text from scalars. A syntax module
does it for him.
- Every syntax module has an independent line-folding, escaping etc. policy.
- Every syntax module has a default de-serializer (e.g. a base64 syntax
module has a default byte-array de-serializer).
- A de-serializer works much as above (turns a text string into an in-memory
- A de-serializer still needs to handle multiple formats (e.g., dates).
- A syntax module is chosen only according to the regexp pattern the scalar
- Implicit typing means using the syntax module's default de-serializer.
- Explicit typing means using the named de-serializer.
Consequences: The opposite of the above. Since one can create syntax modules
which behave any way they please, then one must be very careful applying
line-wrapping etc. to scalar values. Generic YAML editors, pretty-printers
etc. will have to be syntax-module-aware to work safely. On the other hand,
this would allow us to define stronger implicit types:
Note that YAML will preserve
line breaks and spaces here
A different tradeoff, certainly.
Other down-side: Defining unquoted scalars in a way that makes them safe in
keys becomes a challenge. It is not enough to specify for each syntax module
whether it is safe. Take the following one:
regexp: <alpha> .*
policy: line-folded, no escaping.
It seems each syntax module may be invoked in two "modes", one for map keys
and one for values. It is the module's responsibility to detect the end of
the key in the first case, the parser's job to inform it of the end of the
value in the second.
Other consequence: This would make it possible to create whole new syntax
forms and embed them inside a YANL file. Take tables for example:
table: =table ('=table.*' is the regexp pattern for tables)
A1 | A2 (line-break ends row)
B1 | B2
I'm not certain we want to encourage that. Such complex types are best
represented as YAML structures (to allow accessing sub-parts using YAML-path
Otherwise you'd quickly get thin YAML wrappers for very complex
application-specific types, and lose most of the advantages YAML gives you
in terms of interoperability, generic addressing of data items, etc.
To summarize, I see some merit in the concept, but I think the tradeoff is
against it. I consider type-independent line folding a big advantage of YAML
and I wouldn't want to give it up unless there's a significant gain, and I
don't see that gain yet.
Can you give some examples of how the extra power provided by this approach
may be put to good use?
- I'd like to introduce for implicit types the concept of a "recognition
pattern" which is separate from the full pattern. For example, for binary,
the recognition pattern is '^\[=' while the full pattern is '\[=(base64
ending with =)\]'. This way the parser won't need unbounded lookahead to
handle "streaming" implicit types like "binary" (or "vector", or "matrix",