On Tue, Jun 19, 2001 at 10:11:10AM +0200, Oren Ben-Kiki wrote:
| Let's finalize the YAML 1.0 spec purely as a file format (plus requirements
| from the data model, of course). Drop all the API stuff, the discussions of
| color, etc. Give up on the !class shorthand, but we'll reserve all
| single-indicator-character keys for "special semantics" to be defined in
| separate API specs. Call this spec YAML-CORE. I could probably finalize it
| in a weekend (given we settle the binary/Unicode issue).
I'd rather have this all be one-spec to give the sense
of unity. That being said, there is nothing preventing
this spec from having multiple sections, and adopting
each section independently.
So... let's finalise on the binary/unicode issue.
There does not seem to be a good solution at the syntax
level, so let's punt the issue for now. Perhaps the API
may have a binary/unicode distinction, but let us leave
the YAML 1.0 file format unicode only.
As for dropping the class notation. I spoke with
Brian on the phone yesterday and he seemed resistant
to this idea. So, a bit more thought may have to
go into this.
| There are going to be two types of APIs. Let's call them "Native" and
| "Incremental" APIs. I suggest we spec them in a separate document from the
| core YAML spec. First, we'll close specs faster. Second, including just one
| API in the core spec would give it more importance then the other, which
| would be wrong. Each of them has its place.
| The "Native" API is a simple load/save API using the language native data
| structures. We'll spec this quickly (once per language?), and probably
| implement it as quickly. We'll end up with a working, very useful, YAML
| C++ (based on stl; a bit of a challenge, unless we use RTTI). I suggest that
| to make quick progress, we just include a simple definition of v/w in such
| specs (as a library function), but not waste too much time hammering out the
| details of color behavior. Call this the YAML-NATIVE-API spec.
| We then move our focus on defining the "Incremental" API, which is a
| (mostly) language-independent push/pull API spec which fully supports color
| etc. This would be the YAML-INCREMENTAL-API spec. Much fun here... (I admit
| to not catching up on the work you've done there yet).
Ok. Your feedback on the yaml.h file would be good...
| In a fourth spec we'll go through the color idiom, what the standard colors
| are, what standard filters an implementation should/could provide, etc. Call
| this YAML-COLOR.
| Quite a job, right? But it drives home the point we simply can't try to do
| it all in one spec. We'll never finalize it; we'll be debating an
| YAML-INCREMENTAL-API issue, say, and in the meanwhile we are not releasing a
| YAML-CORE spec - because both are included in the same document.
Well... I think I'd like to have them in separate "sections"
of the same specification. We can finish the serialization
format and label this section FINAL, while leaving the other
Also, this would allow for another section (very small)
called "information model".
| I'd want v/w to be a "must" part of the Native API, in the sense
| that "you aren't YAML if you don't provide them" (they are trivial
| enough to implement). But they are clearly optional in the sense
| that you don't have to use them.
| > On the subject of encoding. I'm starting to like the
| > idea of using a ^ indicator for this purpose.
| OK, wait a sec. You've written:
| > I was taking a walk with my g.f. and discussing this item,
| > it seems that there is a triparted situation:
| > | Java / C# | Python | Perl |
| > ----------+------------+---------+----------------------------+
| > ASCII | String | String | Scalar (UTF8 or Otherwise) |
| > UNICODE | String | Unicode | UTF8 Marked Scalar |
| > BINARY | Byte  | String | Not UTF8 Marked Scalar |
| So, if I'm reading this right, you are suggesting we distinguish between
| "ASCII text" and "Unicode text". In the information model. Hmmm.
I respectfully withdraw the ASCII suggestion.
| I predict that with time the problem will slowly fade away.
| Perl and Python will come to terms with Unicode. So, with
| time the accuracy of our heuristic (16-bit == Text,
| 8-bit == binary) would rise. The "override" capabilities
| would be used less and less, and eventually we'll put the
| problem behind us.
| I've no problem with defining a canonical form (4 space indentation,
| multi-lines values start on a line of their own, use the simplest possible
| text style, put a single space between the key and the following ':', one
| space between the ':' and the value, use "most folded" white space for text,
| break lines after 76 characters but include at least 32 text characters, use
| prefix ':' in lists only for multi-line simple scalars, align multi-line
| quoted string text by one space as in our examples - I think that covers
| it). The problem is that at least some of these decisions are a matter of
| taste. It would be different if we defined the canonical form as "the form
| with the minimal number of characters" - but then it may not end up to be
| very readable. And of course we may simply not define one at
| all. Thoughts?
I think we can discuss the canonical form when looking
at the writer code in the impl I'm (slowly) working on.
We can codify this... and then debate if the right
balance was chosen. As far as the prefix ":" this will
definately not be in the canonical form as Brian has
objected to it. For multi-line scalars in a list
either the block or quoted form will be used ...
| As for 32-bit characters, I haven't done my homework on that.
| What does this mean, exactly? Is it OK to say that YAML uses
| 16-bit characters and that the 32-characters are always
| represented as a "surrogate pair"?
Yes. We are using Unicode, so one may use UTF8, UTF16, or UTF32
as an internal character model; they are all encodings of the
same character set, UCS4. Recently USC4 has been restricted to
those code values expressable by Unicode, appx 2^21 characters.
Thus the ISO universal character set is now the same character
set used by Unicode. So much of this "confusion" is now gone.
UTF8, UTF16, and UTF32 all encode exactly the Char production
in the YAML spec. Nothing more, nothing less.
Of course we cannot dictate what Unicode encoding a language
binding uses, it could be UTF8 (Perl). In C, wchar_t is often
either UTF16 or UTF32, depending upon the unix platform.
I understand that the unicode/ucs values over 0xFFFF have not
been assigned, although I have heared that many Chinese characters
which are not yet represented by the universal character set
will be put into this region.