From: Oren Ben-K. <or...@ri...> - 2001-06-19 08:10:37
|
Yet another long post... But I think we are getting somewhere, so it may be worth reading anyway :-) Clark C . Evans [mailto:cc...@cl...] wrote: > I look at it this way: > > a) We eliminate the short-hand notation so that it is > not confusing what is a map and what is a scalar. After much agonizing, I independently came to this conclusion. In fact, I'd like to take it a bit further. Let's finalize the YAML 1.0 spec purely as a file format (plus requirements from the data model, of course). Drop all the API stuff, the discussions of color, etc. Give up on the !class shorthand, but we'll reserve all single-indicator-character keys for "special semantics" to be defined in separate API specs. Call this spec YAML-CORE. I could probably finalize it in a weekend (given we settle the binary/Unicode issue). There are going to be two types of APIs. Let's call them "Native" and "Incremental" APIs. I suggest we spec them in a separate document from the core YAML spec. First, we'll close specs faster. Second, including just one API in the core spec would give it more importance then the other, which would be wrong. Each of them has its place. The "Native" API is a simple load/save API using the language native data structures. We'll spec this quickly (once per language?), and probably implement it as quickly. We'll end up with a working, very useful, YAML support for Perl (Data::Denter), Python, C#, Java, JavaScript, and possibly C++ (based on stl; a bit of a challenge, unless we use RTTI). I suggest that to make quick progress, we just include a simple definition of v/w in such specs (as a library function), but not waste too much time hammering out the details of color behavior. Call this the YAML-NATIVE-API spec. We then move our focus on defining the "Incremental" API, which is a (mostly) language-independent push/pull API spec which fully supports color etc. This would be the YAML-INCREMENTAL-API spec. Much fun here... (I admit to not catching up on the work you've done there yet). In a fourth spec we'll go through the color idiom, what the standard colors are, what standard filters an implementation should/could provide, etc. Call this YAML-COLOR. Quite a job, right? But it drives home the point we simply can't try to do it all in one spec. We'll never finalize it; we'll be debating an YAML-INCREMENTAL-API issue, say, and in the meanwhile we are not releasing a YAML-CORE spec - because both are included in the same document. > b) We state the obvious in the spec -- adding items > to a map will, in almost every case, be a safe > addition and unlikely to break programs depending > upon its structure. Agreed. This is a YAML-COLOR issue. Keep it out of YAML-CORE, except maybe a passing statement. > c) We state that promoting a scalar to a map or list > may only be supported by particular programs and that > this technique must be used with caution. We can > underscore those cases where a structual upgrade > will work, when the applications accessing the > information: Agreed. Again, a YAML-COLOR spec issue. > 1) are written to use the visitor/iterator > interface and not dependent on a particular > in-memory representation; or That is, are written using the Incremental API. > 2) are written in languages which use a > YamlNode instead of native data structures; or Just another form of the Incremental API (I should hope). I claim that YamlNode should be the basis for the push API (the visitor visits such nodes) and hence the pull API (keep a stack of YamlNode, for example). > 3) are written to use the (w/v) methods as > described. That is, using the add-on from the Native API. Yes. > d) We also talk about the "standard" colors, such > as class, comments, classes. And we discuss the > base filters, in particular, how the standard > colors can be added (with approprite promotions) > and used with unexpecting older code if the > appropriate stripper is used... Definitely YAML-COLOR. > Overall, I think it is important to underscore that the > w/v mechanism is optional. Most users of YAML will > use them if they are "libraries" and don't have control > of the YAML. However, one-offs probably won't. Proper > filters, of course, would have to be written robustly. I'd want v/w to be a "must" part of the Native API, in the sense that "you aren't YAML if you don't provide them" (they are trivial enough to implement). But they are clearly optional in the sense that you don't have to use them. > On the subject of encoding. I'm starting to like the > idea of using a ^ indicator for this purpose. OK, wait a sec. You've written: > I was taking a walk with my g.f. and discussing this item, > it seems that there is a triparted situation: > > | Java / C# | Python | Perl | > ----------+------------+---------+----------------------------+ > ASCII | String | String | Scalar (UTF8 or Otherwise) | > UNICODE | String | Unicode | UTF8 Marked Scalar | > BINARY | Byte [] | String | Not UTF8 Marked Scalar | So, if I'm reading this right, you are suggesting we distinguish between "ASCII text" and "Unicode text". In the information model. Hmmm. Let's see. In Java/C#, being modern languages, there's no problem using just two types. That's obviously the "right thing to do". Doing three types would complicate things for them - how would you round trip an ASCII string via Java? Forcing such strings to become maps is the wrong answer :-) In Perl/Python, being older languages, there's an issue using just two types. This ties in with their problem of them supporting Unicode in the first place. Now, if by doing a three-way split we could have avoided using heuristics - the system would always do the right thing on both read and write - I'd consider it. But the table above shows that on minimum, on output you must use heuristics to distinguish between binary and ASCII. So, we must rely on heuristics to "do the right thing" in most cases. We also must provide some mechanism to override the heuristic if it guesses wrong. E.g., the printer will get an optional list of path expressions which must be printed as binary or ASCII. In Python (and only in Python) there's also the problem of reading. A reasonable heuristic would be to de-serialize any purely-ASCII text as an ASCII string ("use Unicode only when you must"). Again, the API must provide a way to change this heuristic (to "Always Unicode for text"), and possibly a way to override this on a more fine-grained level (a list of path expressions would make sense). I predict that with time the problem will slowly fade away. Perl and Python will come to terms with Unicode. So, with time the accuracy of our heuristic (16-bit == Text, 8-bit == binary) would rise. The "override" capabilities would be used less and less, and eventually we'll put the problem behind us. If, on the other hand, we enshrine the concept of "ASCII string" in the YAML information model, we'll be stuck with it. For example, whatever tricks we'll do in the Java/C# API to support it will haunt us forever. In short, the issue is ugly. I'd rather concentrate the ugliness in one single, well-defined point (using heuristics in the parser/printer of Python/Perl) so that it doesn't harm anything else (our information model, the Java/C#/future languages APIs, etc.). If we agree on the above, there are only two (easy) issues to settle to get the spec to a "pre-release" state (pre-release for me means we concentrate on polishing issues - language, examples, formatting, etc.). The two issues are whether we define a canonical form or not, and the details of the 32-bit characters. I've no problem with defining a canonical form (4 space indentation, multi-lines values start on a line of their own, use the simplest possible text style, put a single space between the key and the following ':', one space between the ':' and the value, use "most folded" white space for text, break lines after 76 characters but include at least 32 text characters, use prefix ':' in lists only for multi-line simple scalars, align multi-line quoted string text by one space as in our examples - I think that covers it). The problem is that at least some of these decisions are a matter of taste. It would be different if we defined the canonical form as "the form with the minimal number of characters" - but then it may not end up to be very readable. And of course we may simply not define one at all. Thoughts? As for 32-bit characters, I haven't done my homework on that. What does this mean, exactly? Is it OK to say that YAML uses 16-bit characters and that the 32-characters are always represented as a "surrogate pair"? So far I avoided dealing with these monstrosities - 16-bit Unicode seemed quite enough, even when dealing with Japanese clients. Clark, can you handle this issue? Have fun, Oren Ben-Kiki |