Re: [Yaml-core] Trying to make sense of things...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mon, Sep 09, 2002 at 11:58:55AM +0300, Oren Ben-Kiki wrote:
| The time zone differences, combined with the fact that we are discussing so
| many issues at once, threaten to overflow my capacity. I'll try to summarize
| and please correct me if I'm wrong or I missed anything.
| 
| We agreed on the following points:
| 
| - #TAB goes.
| - Top level scalars should be implemented by at least one space.

I guess so.  I don't think it is necessary, but I don't 
feel strongly about it.

| - Top level maps should not. I'm worried about Brian's example of:
| ---
| Key : value
| --- : value
| ...
| 
| We need to decide whether that's an error or a valid key. But that's really
| minor.

It should be an error.  One could quote the --- thingy or indent
all of the keys by one space.  Really, I don't see the need to indent
top level scalars unless one line of the scalar starts with ---,
and that's a rather rare case.  I think the quoting of --- will solve
the key issue.

| - No \ escaping in transfer methods.
| - Various wording fixes proposed by Brian.

An probably more to follow.  ;)

| - The viewer/graph/generic model issue. This seems to have been settled to
| the satisfaction of all parties (rename generic to graph, and use a
| generic/native distinction for its representation). I thinks this is a good
| clarification.

Great.

| We have the following open issues:
| 
| - Key order. I agree with Clark that this should not be a (mandatory) part
| of the graph model. The shorthand notation allows a clean and easy way to
| serialized ordered hash tables. Could someone please explain why using this
| syntax for PHP ordered maps etc. is not a satisfactory solution?
| 
| Since the ruby people came up with !ruby/pairs for what seems to be exactly
| the same reason, would you (PHP/ruby/etc. people) push for having a "!dict"
| type family that is a language independent ordered map (similar to a
| real-world dictionary) that uses this shorthand syntax? It doesn't even have
| to be in the core spec, we could add it later.

I think the requirement for "ordered keys" will dissipate once
we have a serious schema language.

| - Whether to use collections or map/seq in the graph model. I'd like to
| preserve the current state of affairs; it makes things much simpler when
| moving between models (otherwise we'd have to deal with the question of !seq
| {} and !map [] again). Using a "collection" model would cause a mismatch
| between the graph and serial models. This would in turn cause a problem when
| trying to define and implement tools that work on these models. I know Clark
| is having second thoughts here... But we had a long discussion about this
| already, and I think we made the right choice. Let's stick with it.

Ok.  I'll drop it.  ;)

| = The tutorial section. We had positive feedback on how it helps newbies to
| ease into the spec rather than bombarding them with definitions up front.
| I'd keep it.
| 
| = Separating the information model to a separate spec so we can freeze the
| syntax document. I might have considered agreeing except that our decisions
| about the model effect the syntax (e.g., the latest #IMPLICIT proposal). It
| so happens that I believe the current syntax is fine as it is, but I still
| have to convince people (Clark in particular) that it is the best model as
| of itself, not because it is too late to change the syntax.
| 
| = I agree with Clark that if the info model section is obscure we should
| strive to simplify it rather than omit it. This is hard; it seems both Clark
| and myself failed to write it in a clear/concise enough manner. Any
| proposals for improvements here are welcome.

Agreed.  I feel very strongly that the model should remain in the
specification, mostly beacuse it provides (or should provide) a clear
vocabulary for talking about YAML and for making assertions about
implementation compliance as far as "intent" is concerned.

| - The implicit typing issue. I'll name the two competing proposals the DWIM
| proposal (optional transfer methods) and the #IMPLICIT proposal. I'm
| completely confused about people's opinions about this - for example, I
| could have sworn that at this point Clark is dead set against DWIM, except
| that he mentioned in some post that it may be best after all. At any rate...
| 
| My problem is that I don't understand most of Clark's objections to the DWIM
| proposal. For example, he made the point that optional transfer method is
| completely equivalent to "no transfer method". I have a hard time following
| this; how is the fact that the loader has some default way to transfer
| nodes, imply that there should be no way to override this default?

I'm not sure what the DWIM proposal is.  I'll re-read the last post. 

| Clark has described the #IMPLICIT proposal in his latest IRC summary post. I
| think I have described the DWIM approach pretty well in the end of my last
| post, so I won't repeat it here. It has the advantages that:
| 
| - It is a very minor change to the info model section (making transfer
| method optional at the graph model)

I'm not certain that this is needed (or even good)

| - Its only syntax change is relaxing some restrictions (we can starts
| strings with $ etc. and we can get rid of the () around the Boolean English
| words, increasing readability by being more DWIM).
| 
| By comparison, the #IMPLICIT proposal causes a change to the syntax (most
| existing YAML documents would become illegal), and makes it less readable
| (required DTD-like directives and surrounding integers by () - Ugh!).

From what I get from the DWIM proposal it means that there is no
way (other than through external knowlege) if a given node is
suppose to be an integer or not.  This removes the "self-describing"
property of YAML and I think hinders the usefulness of types.

| So. Please show that this new syntax is absolutely necessary. By this I
| mean:
| 
| - Please describe a scenario (program A does this, tool B does that, program
| C reads the result, and so on) where the DWIM proposal leads to trouble and
| the #IMPLICIT proposal does not.

The #IMPLICIT proposal puts common type information into a YAML file
in a way that is independent of application semantics.  Thus, I could
pop the file into a YAML Query, give it a query and it'd know how to 
load the file and operate on it.  

With DWIM, the user would have to "register" or some how tell the 
toolset which nodes have which types; this will involve N mechanisms
and actually will probably lead to a schema to be really useful 
(which is ok.  I think a schema should be able to do this without
the #IMPLICIT thingy.  The problem is we don't have schema yet
and probably won't for another 6 to 18 months.

Further, the default #IMPLICIT could be our current set of 
implicits... requiring #IMPLICIT:OFF to get to the new behavior
where everythign is a string.

| - Alternatively, please show how this proposal limits the power of
| graph-level YAML tools - that is, what operation would be possible in
| YPATH/CYATL/etc. under the #IMPLICIT proposal that would not be possible
| under my DWIM proposal. I think that in both proposals the implementation
| must provide a way to specify at run-time regexps _and_ comparison
| operators; the only difference is whether the parser or the loader is one
| module that uses these specifications.

Right, and if the loader is responsible, then generic tools
written against the SERIAL model cannot take advantage of
type information.  IMHO, this kinda sucks.

Also, I'm not sure what impact the typed vs not-typed flag will
have in the graph model.  I'm ok with it in the syntax model.
Handling NULL cases complicates things, and I'm not sure that
the benefits outweigh the consequences.    For example, YPATH
would have to add a COALESCE (IF_NULL) function to allow types
to be comparied.  I'm not saying that its bad, just that I don't
konw the consequences/impacts and haven't had time to study them.

| - YAML documents are very readable by humans.
| 
| The DWIM proposal is more readable than the #IMPLICIT one (no #IMPLICITs, no
| () around integers, dates, Booleans, etc.).

Well, the () and #IMPLICIT are quite complementary, you
probably don't need both in a single document.

| - YAML interacts well with scripting languages.
| 
| I think the #IMPLICIT proposal requires, or at least encourages, that
| YAML-specific types be used to represent implicit types while the DWIM
| proposal allows most cases to be represented as normal strings, which is the
| natural strategy of most scripting languages (Perl/TCL/Korn
| Shell/JavaScript/etc.). I could be wrong here.

How am I going to save/restore an integer without jumping
through a ton of hoops?  -- learning how and registering the
implicits with both the Perl and Python parser... each perhaps
with slightly different ways to do it.  Further, if my app
is distribued, I'll have to somehow explain to people using
my data files that X is an integer, but Y isn't a such-and such.
I think #IMPLICIT and () do this very nicely without having
to have a third document.

Overall I think the DWIM mechanism to be really useable in
anything other than a quick-one-off will have to migrate into
a full-blown schema.  I support this; but we still need something
which is short-term.

| - YAML uses host languages' native data structures.
| 
| I don't see a difference between the two proposals here.

Well, in the DWIM case, it goes from regex->native; but in
the other case, it first goes through a type-uri.  Thus, I think
that the intermediate type-uri is very valueable step in that it
provides a handle for us to talk about similar types across
language boundaries.  With DWIM, you don't need type-uri cuz
it's all registered directly.  This may seem simpler, but it
hinders the ability to formalize abstract types and provide
for consistent bindings across language boundaries.

| - YAML has a consistent information model.
| 
| I don't think the DWIM proposal has an inconsistent information model. I'll
| go so far as agreeing its information model is weaker (more relaxed). So
| while both proposals satisfy this goal, #IMPLICIT is more true to its
| spirit.
| 
| - YAML enables stream-based processing.
| 
| Same for both proposals.

Not really.  In the DWIM proposal, I can't have the string+type-uri
combination to process with.  And as such, a mode of processing
(which would primarly be stream-based) would be curtailed.

| - YAML is expressive and extensible.
| 
| The DWIM proposal is more expressive than the #IMPLICIT proposal (there are
| less restrictions on implicit types). Both are extensible, I guess.
| 
| - YAML is easy to implement.
| 
| I have no idea which is easier, I doubt there's a big difference.
| 
| To summarize, it seems that the main difference here is that DWIM is more
| readable, which is our #1 goal for YAML. In the secondary goals, there's a
| different tradeoff, with no clear-cut winner; goal #2 for DWIM(?), goal #4
| for #IMPLICIT, goal #6 for DWIM.

DWIM       Good for custom types which are application specific
#IMPLICIT  Good for standard types with well-known language bindings
Schema     Good for custom types (DWIM-ish) which are problem domain specific
()         Good for simple one-offs using common well-known types, 
           like (true) or (2002-01-01) without requiring #IMPLICIT header
           or DWIM registration.

In our conference we were OK with Steve's registration system, this
is similar to DWIM I think... right?   Only that the DWIM happens at
the loader level (with the quoted flag available) leaving the model
to still have a mandatory type... steve uses String rather than having
a NULL type.  In short, by keeping the type mandatory, DWIM and IMPLICT
can co-exist (and is what we agreed to last night).

Speaking of #IMPLICIT.  We could call this #SCHEMA, and until we have
a formalized schema, use the various IMPLICT groups (generic,math,business)
as very very simple schema.  Then, when schema comes along, these would
be upgraded into full-blown schema which only assigned type based on regex
and didn't limit structure.  I kinda like this approach (it reduces one
less mechanism in the long-run).  So, you can think of our intial set
of allowable SCHEMA as "bootstrap" values with hard-coded meaning.  We
can then fix this later in a backward-compatible manner.

Best,

Clark