From: Oren Ben-K. <or...@ri...> - 2001-08-03 11:01:30
|
Hi Guys, First, the listmail server is going berserk. And the web server doesn't let me post for some reason. And this is my weekend, so I'm working from home, and my dialup office server has its own strange notions about this. So I'm going to be out of it for the next few days. (Brian, it seems us sharing views has caused us to share connectivity problems. A conspiracy by gremlins, no doubt :-) At any rate, I like your indicator-less syntax. I'll have to consider the details, though. Using '=' as an indicator for simple scalar values is nice. We'll have to require it for multi-line quoted strings as well, to allow a map whose first key is quoted. As to the '---' separator issue. I think that Clark got it right in suggesting the top level production would be either a simple map (just one) or a simple list (just one). Log files etc. will just use the list production: : % first map... : % second map... For that matter, we could also allow a scalar as the top level value, now that we have the indicator to disambiguate it. In fact, we could use the exact same auto-detection algorithm for the whole file as we'll use for a value within the file: File/value starts with '@' or ':' -> list. File/value starts with '=' or '|' -> scalar. File value starts with '%' or anything else -> map. (Note that a file/value starting with '"' would be taken to mean a map whose first key is quoted). Thus, Brian's and Clark's proposals are really just two sides of the same coin. Neat! As to the data model issue. Thanks Joe for writing it up. We'll have to modify the "info model" section of the spec on similar lines. As for the details... I'm going to play the "purist" here. What I want is for the *core* info model to be a *tree* of map, list, null and text scalar. And that's all. No integers, reals, types, classes, binaries, id-based-references, key-based-references, YPath-based-references, URL-based-references, or anything else, no matter how vital it seems at first glance. This info model is immensly useful (e.g., conf files, log files, database record dumps, simple messaging). Yes, it isn't enough for some applications (especially native data serialization). The answer is not to add a selection of the above to the core. Not even one. Especially not "types", because that would kill the 1-1 load/save API into native data structures. Instead, we'll ensure the YAML APIs will allow for a *graph* whose nodes include "native data object" as a valid node. Note that this node may be a non-leaf node, references and cycles are allowed, etc. How is it possible to reconcile these two opposing statements? The answer is (star wars music) "Trust the Color, Luke!". The color idiom is built on two concepts. One, the ability to put on "colored glasses" and see only keys of a certain "color". Two, the ability to easily and efficiently construct an application by layering different modules. These two reinforce each other. (Think of each "color" as a regexp pattern. If the key matches the pattern, it has this color. In XML, for example, a color would typically be 'some-namespace-prefix:.*' - let's not delve any deeper there :-). The core YAML layer (parser/printer) is as trivial as you can get - just handling a tree of map/list/string. A different layer would handle mapping from this core data model to native objects. This layer could use specially colored keys to control this mapping: which native type to use, how to parse the value, how to identify references, etc. It could also use patterns on the values ("this looks like an integer"). It could use schema information ('delivery' is expected to be of type 'date'). It could use some combination of the above. Since this layer is not part of the core, I don't care much. So if Clark's application needs the concept of reference-by-key, Joe's needs reference-by-path, and Brian's needs reference-by-anchor, There's no problem involved. We can all do it without messing up our simple, clean YAML core. In fact, it may be possible to mix and match their implemenations in one application, as long as they are well behaved. How would Briant write YAML.pm, then? He needs references and classes. Would his implementation be "Brian's YAML variant", then? Well, no. First, he'll have to separate it to a different layer somehow (at minimum make it optional). Second, to make it interoperable with other implementations, I suggest we define a dictionary of a language neutral set of colored keys and values, one which would allow a simple way of interchanging most useful data types between multiple languages. This would include a basic way to handle references (what we have today), and a set of common basic types. Make this an accompanying spec to the core spec itself (YAML-DATA-1.0?). Or bundle it with the core spec (as an appendix?), as long as it is "sufficiently speparate" so we can create a new version for it without touching the core itself. We will need these future versions! Surely we'll want to have interoperable reference-by-url one day, right? Where does all this does effect the YAML core itself? In three ways: - The YAML APIs should be defined so that they will work for a graph containing native data structures. This means storing all the state of a visitor/iterator in the visiting/iteration context as opposed to within the nodes themselves. It should be possible to apply a YAML visitor/iterator to any native data structure. This is achievable in Perl, Python, JavaScript and even in Java (with more effort). In C++ etc. we'll have to require the native data structure to cooperate somewhat - implement some interface etc. - We need to define the color pattern for the mapping keys. We can define just one such pattern (today: single character indicator keys), or we can delve into the issue of how to solve the general problem of managing globally unique application specific colors (or ids in general). - We need to have a way to make YAML files readable even when colored keys are used. At minimum, we have to ensure they are readable when the mapping layer's colored keys are used. This is too basic a use case to leave it to the full map syntax. We may also choose to try to provide a general mechanism for more readable application specific colored keys. The first issue (API) is technically difficult. However it is certainly possible to solve. Also, it has no bearing whatsoever on the core YAML format spec (it would effect the core API spec). My proposal: Let's table it for a while. This doesn't effect Brian or anyone working on the high-level load/save API. The second issue (Color patterns scheme) is politically difficult - or at least the general problem is. Historically, XML has chosen a horribly complex way to do it (document-defined mapping from prefixes to URIs). Java has chosen a simple but verbose way to do it (reverse DNS strings); a shorthand mechanism ('import ...') battles verbosity. IANA has suggested a simple, terse but centralized way to do it (central registry of universal color patterns). My proposal: Let's keep the "single indicator character" pattern reserved for this. Define a set of such keys in DATA-1.0: '!' for type, '#' for comment, and '&' for anchor; reserve some for future use; and reserve some for application-specific use. As for the general problem, let's (gulp) just ignore it. This issue has been trashed to death in XML-DEV and SML-DEV. Clark and I have seen many ideas to solve this problem, which nobody was happy with. I think simply ignoring it is the 80/20 benefit/cost point. And I really don't want to reconstruct all the arguments. Just trust us on this :-) The third issue (readability) is a matter of taste. That may be the most difficult issue of all :-) The problem here is the conflict between a special, terse syntax for more readable files and a verbose but simple syntax for more intuitive files. My proposal: use the %(<key><value> ...) shorthand for the color keys. The '%' makes it clear there's a map involved (or, for a complete newbie, that something fishy is going on). It is short, extendible, but not *too* extendible. As for the general problem: I don't think we should do anything other then reserve some special keys for application specific use. As Joe pointed out, the use of the shorthand weakens the simplicity of the files, so it had better be reserved for sufficiently painful cases. Restricting the set of special keys ensures it will be so. So, I propose: - We review the information model - make it more formal and support only map/list/string/null. - We unify Clark and Brian's proposals for auto-detection of a value, including the top-level value. - We put the %(...) shorthand format in the core spec, but without describing the semantics of any special keys. - We get an extremely stable YAML-CORE-1.0 spec. - We create a YAML-DATA-1.0 spec describing # & ! (comment, type and reference). - We start working on YAML-API-1.0 with reference implementations. Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2001-08-03 21:50:43
|
On Fri, Aug 03, 2001 at 03:45:13PM -0400, Clark C . Evans wrote: | I think we are awfully close. Now that you mention it, | I *do* like the idea of two information models: | | YAML-BASE: Untyped tree information model consisting | of Map, List, Scalar, and Null. | | YAML-CORE: Typed graph information model building on | the base model. sample: % 1:% ! : com.clarkevans.Person given : Clark family : Evans 2:~ 3: 2.34e0 4: "2.34e0" 5: !com.clarkevans.Complex 2+3i 6: &0001 Referenced 7: !com.clarkevans.Complex &0002 5.2+3.5i 8: *0002 9:@ : Lists cannot be typed : Scalars are typed via recognition : Or scalars are typed via ! : Maps are typed via the ! key : User data that uses the ! as a key will be mis-interpreted. When viewed through YAML-BASE, what you see is what you get. No stripping, etc. When viewed through YAML-CORE, the type indicators are stripped, references are expaneded, etc. Frankly, I don't see how you can layer and have consistency. Conclusion: Layering is not the holy grail. I respectfully withdraw YAML-BASE. Best, Clark |
From: Oren Ben-K. <or...@ri...> - 2001-08-04 07:52:21
|
On Fri, Aug 03, 2001 at 03:45:13PM -0400, Clark C . Evans wrote: | I think we are awfully close. Now that you mention it, | I *do* like the idea of two information models: | | YAML-BASE: Untyped tree information model consisting | of Map, List, Scalar, and Null. | | YAML-CORE: Typed graph information model building on | the base model. Base and Core? Hmmm. I'd rather stick with "core" and "data" or "interop"; "core" and "base" make it confusing about which one is more basic. At any rate, I agree that your regexp-based solution to layering one on top of another doesn't work: - I gather that: 3: 2.34e0 Should be a real and: 4: "2.34e0" Should be a string. Correct? This doesn't work. Because these two are completely identical in the more basic data model (both are "string"). So it isn't possible to use a layered approach to implement this distinction. The same applies to: 6: &0001 Referenced Vs.: Just what it looks: "&0001 Not Referenced" - Regexps aren't enough to distinguish between types. They are OK for providing defaults, but not enough by themselves. Your eamples already started to suffer from little ugliness. I'd expect to be able to write: 4: 2.34 For a real number. Also, is: delivery: 1/3/2001 Jan 3rd or March 1st? How about: image: ....base64..... Is it a GIF or a Gzipped BMP? - Working at the base level becomes terrible. Consider: distance: &0001 2.34e0 An application working at the lower level has no easy way to access the distance value. It *must* apply a whole host of regexps to it to remove any "color" which may have been applied to it. Which may be almost bearable for reading the value - but imagine what it does to *writing* it. Ugh. - Taking a scalar value and smashing it into parts, using regexps, in order to create a map, just isn't simple or intuitive. Or, as I've shown above, robust or general enough. > Conclusion: Layering is not the holy grail. At least, not when done using the regexp method :-) > I respectfully withdraw YAML-BASE. Well, I agree if you are refering to this regexp method. Back to my shorthand form, I don't see that it suffers from any of the issues I listed above. > Frankly, I don't see > how you can layer and have consistency. Well, let's see. Since the shorthand form is already a map in the base layer, then: - There's no problem for "2.34" being a real, text, BCD or what have you. - It is possible to use regexps so that the default type would be 'real': real: 2.34 BCD: %(!bcd) 2.34 text: %(!text) 2.34 - Generic access to the value is possible using the v/w operations. This would work at both layers. - As for "stripping" the color, anchors etc. I've seen this mentioned several times as a problem, and I'm baffled by it. What exactly is the problem? - As for simplicity. I see the lowest level as being the simplest possible data model which has 1-1 mapping to a YAML file. On the other hand, the higher data model is hardly a YAML data model, since it allows any native data structure to be used. I don't think we even have a good definition of it. "Anything at all", perhaps... At any rate, it isn't simple. I see a lot of sense in doing a split and using a layered approach. The lowest "core" level is extremely stable. The design decisions made in it are independent of the application, programming language, character set etc. The format is very simple. This is a rock-solid foundation one can safely build upon. The desrialization layer is volatile. To be fully specified, it needs dictionaries of recognized types, reference mechanisms etc. all of which are application and language dependent. There are environment issues. There are round trip issues. There are cross-language and cross-application interoperability issues. It is a grand mansion raised on top of the core foundation. Everyone will want one which is slightly different. By keeping them separate, I seek to prevent the situation where we'll be changing the YAML spec every time someone comes up with a new must-have data type or reference mechanism for an important new class of applications (ISO dates? PNG image format? URL based references? MIME based? key based?). We'll be able to have all this fun in the deserialization spec without changing one bit in the core YAML spec. In fact, I'd rather make it clear from day 1 that all the above issues rightfully belong to the specific YAML document schema. Our higher-level YAML spec won't be "here is the way you must do references"; it would be "here are some ways to do references, you can use them in your schemas". Likewise for types: "here are some types you can use in your schemas". That's why I wanted to call it YAML-INTEROP. It would define a small set of types and reference mechanisms so that by sticking to this set you would ensure interoperability with every YAML implementation following this spec. Specifically, we should make this set the minimal set which would allow interoperation between Perl, Python, Java, JavaScript and C++ programs. That's not a large set. If someone wants to go beyond this set, that's OK. It is still YAML. If someone wants to use less than this set, that's also OK. It is still YAML. In contrast, one can't change one bit in the core spec and still be YAML. Let's keep them separate. Have fun, Oren Ben-Kiki |
From: Brian I. <in...@tt...> - 2001-08-06 12:44:58
|
NOTE: This is a repost from a few days ago. My mailer was having problems. On 02/08/01 14:14 -0400, Clark C . Evans wrote: > On Thu, Aug 02, 2001 at 05:19:06AM -0700, Brian Ingerson wrote: > | 4) Our scalars must be able to represent integers and reals in a simple > | way. Python and others need this to round trip. I think the best > | solution is to use simple scalars to mean numeric data and quoted > | strings to encode strings that look like numbers. > | > | integer : 42 > | string : "42" > | real : -12.34 > | string : "-12.34" > > This is a tough one, in essence I think you have a short-hand > for indicating that a scalar node has one of three types: > > - Integer > - Float > - String > > This definately adds type/class to the information model... No. It just adds string, int, float to the info model. They are native in almost all YAML languages. And they must have a short-hand syntax to be usable. > > | 5) This is a big one, so take a seat. > > | proposal: @ > | : Every data type has a sigil. (%@|-"!=) > | : = > | The sigil is used when the data begins on the next line > | (or is an empty list or map) > | : The sigil can be ommitted if the context can be easily > | inferred. > > Ok. The primary impact of this seems to require that the : character > be restricted for multi-line simple scalars, unless a new indicator > is present. I don't like = for that indicator... since it means > 'default node' how about ^? I would word that differently. I would say the primary impact is that *every* data type has a unique sigil which is used after the ':' if the data doesn't start on the same line. It can be ommitted if the context can be easily inferred from the following line. I don't have a strong opinion on the character, since it can almost always be ommitted. Perhaps '$' is the logical choice for both quoted and simple scalars. > > | 6) FYI, we don't need to use "----" to separate YAML documents. We can > | still use a blank line. That's because a blank line in a scalar would > | not really be blank. It would contain indentation spaces. This is > | true in every case. We could make the separator be two or more > | newlines in a row in the stream. This might be cleaner. > > Hmm. I guess this would work... > > | 7) I just read Oren's comments on attributes and I whole-heartedly > | agree. (I'm agreeing with Oren on almost everything these days. What > | a difference :) YAML must be represented by only scalars, lists, and > | maps at the core. Perl, Python, Tcl, PHP, etc must be able to dump > | and slurp with minimum effort. Then YAML becomes a visual > | representation of the in memory layout of these language's native > | data structures. And interchange becomes automatic. This is the key > | selling point of YAML. > > I'm suggesting that we modify the information manual to include > exactly two attributes with every node: (a) type, and (b) name. > Where the name attribute only occurs in the sequential information > model with reference nodes. Definitely not. See last post. > > | 8) As far as special keys go, I prefer a hybrid solution. Use '=' and '!' in > | the YAML document, but parse them into '__=__' and '__!__' in memory. I've > | always thought that. I'm just now voicing it. > > Right. Then to get = as data, one uses "=" for the key. Ok. Exactly. > > | PS YAML is being quite well received to the few people I've shared with. > | People are already wanting to use it for various projects. Let's keep moving > | forward. > > Very cool. > > Best, > > Clark > > _______________________________________________ > Yaml-core mailing list > Yam...@li... > http://lists.sourceforge.net/lists/listinfo/yaml-core |
From: Clark C . E. <cc...@cl...> - 2001-08-03 19:16:24
|
On Fri, Aug 03, 2001 at 02:02:15PM +0200, Oren Ben-Kiki wrote: | I'm going to play the "purist" here. What I want is for the *core* info | model to be a *tree* of map, list, null and text scalar. ... | Instead, we'll ensure the YAML APIs will allow for a *graph* whose nodes | include "native data object" as a valid node. Note that this node may be a | non-leaf node, references and cycles are allowed, etc. Ok. So we are going to make two information models then, one which is layered over the other? My primary concern: I don't want a short-hand mechansim where what appears to be a scalar in the file is actually a map when loaded into memory. In other words, I don't want: tag: !class value -> tag: % !:class =:value | How is it possible to reconcile these two opposing statements? The answer is | (star wars music) "Trust the Color, Luke!". The color idiom is built on two | concepts. One, the ability to put on "colored glasses" and see only keys of | a certain "color". Two, the ability to easily and efficiently construct an | application by layering different modules. These two reinforce each other. | (Think of each "color" as a regexp pattern. If the key matches the pattern, | it has this color. In XML, for example, a color would typically be | 'some-namespace-prefix:.*' - let's not delve any deeper there :-). Coloring? A layered mechanism for type determination of scalar values based on regular expression matching. This mechanism plus the base information model gives a "typed" model. Coloring only applies to scalars, and thus only a scalar can be typed with this mechanism? Defaulting A mechanism where a map or list can be treated as a scalar. This provides for forward-compatible changes by allowing programs expecting a scalar to be given a map/list. Note that this mechanism depends upon the knowledge of what the application program is expecting, and thus it is a "pull" property and is not supported directly via language bindings. | The core YAML layer (parser/printer) is as trivial as you can get - just | handling a tree of map/list/string. Ok. | A different layer would handle mapping from this core data model to native | objects. This layer could use specially colored keys to control this | mapping: which native type to use, how to parse the value, how to identify | references, etc. It could also use patterns on the values ("this looks like | an integer"). It could use schema information ('delivery' is expected to be | of type 'date'). It could use some combination of the above. | | Since this layer is not part of the core, I don't care much. So if Clark's | application needs the concept of reference-by-key, Joe's needs | reference-by-path, and Brian's needs reference-by-anchor, There's no problem | involved. We can all do it without messing up our simple, clean YAML core. | In fact, it may be possible to mix and match their implemenations in one | application, as long as they are well behaved. Ok. I think I got it. We could have a type layer which parses the scalar key: !java.lang.Date 2001-3-23 as a Date object for 23 Mar 2001. To give a map a class name we could, at this layer have the convention of using the "!" key. Therefore... key: % ! : com.clarkevans.DateRange from: 2001-3-23 to : 2001-3-23 Which would be a map, having the given class. In both of these cases, the YAML is "stupid", but the higher layer that knows about objects can take hold. Similarly, at this layer, Brian's types can be handled... int: 34 string: "34" float: 3.4 | - The YAML APIs should be defined so that they will work for a graph | containing native data structures. This means storing all the state of a | visitor/iterator in the visiting/iteration context as opposed to within the | nodes themselves. It should be possible to apply a YAML visitor/iterator to | any native data structure. This is achievable in Perl, Python, JavaScript | and even in Java (with more effort). In C++ etc. we'll have to require the | native data structure to cooperate somewhat - implement some interface etc. Ok. Thus... a coloring layer for references would intercept scalar values having & and * via regular expression matching and re-writing. val: &001 23.45 ref: *001 would add the "anchor" to the mix, and produce a de-normalized view: val: 23.45 ref: 23.45 Ok? But this, once again, is done in a higher layer. So the base YAML would see the above as it is... complete with the & and *. | - We need to define the color pattern for the mapping keys. We can define | just one such pattern (today: single character indicator keys), or we can | delve into the issue of how to solve the general problem of managing | globally unique application specific colors (or ids in general). | | - We need to have a way to make YAML files readable even when colored keys | are used. At minimum, we have to ensure they are readable when the mapping | layer's colored keys are used. This is too basic a use case to leave it to | the full map syntax. We may also choose to try to provide a general | mechanism for more readable application specific colored keys. I'm not sure you can limit coloring to any particular notion of "keys" as if the colors may be stripped... I'm skeptical that this would wokr. Take the refernece example below. Stripping the colors would leave "ref" as a blank string and this is clearly not acceptable. | The first issue (API) is technically difficult. However it is certainly | possible to solve. Also, it has no bearing whatsoever on the core YAML | format spec (it would effect the core API spec). | | My proposal: Let's table it for a while. This doesn't effect Brian or anyone | working on the high-level load/save API. I'm not sure that we can just ignore this issue as I'm not convinced that layering can work without making N YAMLs (one for each possible combination of layers). And N YAMLs is bad. | The second issue (Color patterns scheme) is politically difficult - or at | least the general problem is. | | Historically, XML has chosen a horribly complex way to do it | (document-defined mapping from prefixes to URIs). Java has chosen a simple | but verbose way to do it (reverse DNS strings); a shorthand mechanism | ('import ...') battles verbosity. IANA has suggested a simple, terse but | centralized way to do it (central registry of universal color patterns). Nod. | My proposal: Let's keep the "single indicator character" pattern reserved | for this. Define a set of such keys in DATA-1.0: '!' for type, '#' for | comment, and '&' for anchor; reserve some for future use; and reserve some | for application-specific use. Ok. I'll respond to the rest under a separate cover. Best, Clark |
From: Clark C . E. <cc...@cl...> - 2001-08-03 19:32:31
|
On Fri, Aug 03, 2001 at 02:02:15PM +0200, Oren Ben-Kiki wrote: | As to the '---' separator issue. I think that Clark got it right in | suggesting the top level production would be either a simple map (just one) | or a simple list (just one). Log files etc. will just use the list | production: | | : % | first map... | : % | second map... | | For that matter, we could also allow a scalar as the top level value, now | that we have the indicator to disambiguate it. In fact, we could use the | exact same auto-detection algorithm for the whole file as we'll use for a | value within the file: | | File/value starts with '@' or ':' -> list. | File/value starts with '=' or '|' -> scalar. | File value starts with '%' or anything else -> map. Nice, but "=" should not be the indicator for a scalar (since it represents the default value). Even $ is better, although I like ^ the best. Best, Clark |
From: Clark C . E. <cc...@cl...> - 2001-08-03 19:39:42
|
On Fri, Aug 03, 2001 at 02:02:15PM +0200, Oren Ben-Kiki wrote: | My proposal: use the %(<key><value> ...) shorthand for the color keys. The | '%' makes it clear there's a map involved (or, for a complete newbie, that | something fishy is going on). It is short, extendible, but not *too* | extendible. I think we are awfully close. Now that you mention it, I *do* like the idea of two information models: YAML-BASE: Untyped tree information model consisting of Map, List, Scalar, and Null. YAML-CORE: Typed graph information model building on the base model. I'm completely opposed to the %() thingy. If someone has a map, then they can use special __type__ keys to designate the type of the map, etc. One moves from the YAML-BASE to YAML-CORE with several regular expressions which act *only* on Scalar values. There are two categories of regular expressions: A. Those that recognize specific indicator keys. This category includes: 1. A mechanism which uses &0001 and *0001 per our first specifcation to build a graph. 2. A mechanism which uses !type to identify a particular class or data type. B. Those that recognize specific data types for simple one line, unquoted scalar values: 1. Integer detection (2345) 2. Float dection (2.34e0) 3. Rational dection (3.45 and 2 3/8) 4. Date/Time dection (2001-03-23, 23 MAR 2001) 5. Complex dection (3+8i) etc. An binding to YAML then specifies through a mask which regular expressions are in effect as determined by what is supported by the given language. Thus, if date/time is supported by the language, then the date/time dection is enabled; otherwise, a string is the returned value. To re-iterate however, I'm very much opposed to the idea of a "scalar" in the YAML document becoming a "map" in the information model. Bad idea. For those "special" attributes (such as type and reference to) it is better to have a layered information model. | - We review the information model - make it more formal and support only | map/list/string/null. Ok, but only under the recongition that this is the BASE information model, and not the CORE or normal information model. | - We unify Clark and Brian's proposals for auto-detection of a value, | including the top-level value. Ok. | - We put the %(...) shorthand format in the core spec, but without | describing the semantics of any special keys. I'd rather do something like the above. |