On Fri, Aug 10, 2001 at 12:29:13PM +0200, Oren Ben-Kiki wrote:
| Yes. Basically, it can use whatever representation it
| wants; the only requirement is to warn if data is being
| lost. Which, it turns out, is impossible to do in the
| current spec... Due to one specific use case.
| Here it is. Suppose my system read a YAML file. It
| recognized all classes. It used a set of implicit types
| (null, reference, int and float). It didn't know about,
| say, tuples. So it read the following two values:
| value1: (1,2)
| value2: "(1,2)"
Since value1 doesn't start with an alphabetic character,
it is implicitly typed. Thus, if an application doesn't
support a regex which matches (1,2) then it must warn
about the conversion of value1 into a string.
Also, even "built-in" types such as integer arn't
safe with a parser written in REXX or some other
languge that doesn't support integers...
| This way the system will at least complain if it loses the
| type info. It follows that the only implicit types that may
| be safely used are "required" types - say int and float.
| User-defined types should always use a marker.
Nope. See above.
| - Do something creative: Use a single '!' as an
| indicator for "implicit type used":
| value1: !(1,2)
| value2: (1,2)
Yes, but much nicer way (due to Brian's brain) -- simple
scalars start with an alphabetic character, implicit scalars
can start with stuff like ( [ $ and numbers. This way,
if an language/parser can't support the implicit type,
then it must give warning upon conversion of the implicit
scalar's value to a string.
Also, this also means that we must have a global registry
for implicit regex types, support isn't mandatory, but
supporting a different type through the same regex is
against the spirit of the specification.
| null: `~
| reference: `*12
| int: `12
| float: `12.5
| object: `tuple (1,2)
| text: "`12.5"
| Much better. This is suspiciously similar to a proposal
| made by Clark, which I didn't appreciate the significance
| of at the time. It also fits neatly with his restriction
| that implicit types won't start with an alphabetic character
| and that class types would be made out of (mostly) such
| characters. I guess I should have paid more attention...
Yes, Brian balked at this, so the ` is virtual... if the
simple scalar begins with an alphabetic character, it is
a string otherwise it is implicitly typed.
currency: $34.34 USD
Grok? The first one starts with alphabetic, thus it
is a string, but the second starts with a special
character, so it is implicitly typed. We can then
register a global regex which matches the above
as a currency designation. If the parser/language
supports currency, then it can convert the above
into the appropriate type; otherwise the application
must be warned about the conversion to a string.
| Like I said, I don't want to stir things up again. I could
| live with either of these choices. But I *really* like Clark's
| ` notation better. If that's what he meant in the first place,
| that is, and if he still stands behind it, and if Brian finds
| it acceptable...
Brian doesn't like the tick... hence our alphabetic vs
| > | So, the proposed compromise:
| > | ...
| > | Agreed?
| > YES. This is a very good compromise.
| OK. Wearing my editor's "nit-picking" hat (it is a dirty job
| but someone has got to do it:-)...
| The required changes to the current spec would be:
| - Add the '!class' syntax back. Or perhaps `class if
| we want to solve the implicit type problem that way.
| Finalize restrictions on class names (must start with
| an alphabetic character, must not contain a space,
| anything else?).
!class is back
| - Change the information model accordingly.
| - Have a few paragraphs about loading to native data
| structures (the choice between discarding data,
| flattening it, or using alternative representation
| model). The guiding principle: use anything that
| works for you but warn if you are discarding data.
| Clarify how this relates to implicit as well as
| explicit types.
| - Define the "common" detected types - null and
| reference for certain, I think we should add
| int and float as well. And text is of course
| is the default. Also mention that implicit types
| must not start with an alphabetic character,
| and are otherwise unrestricted.
I think that the types will be defined in an appendix,
and the inital appendix should include:
- Null ~
- Int 33849
- Float 343.34[e+-234.34]
- Date 2001-03-23 and 03/23/2001 and 23 MAR 2001
- Time 11:00 AM/PM or 23:00
- Date/Time per ISO 8601
- Currency $34.00 [USD]
- Tuple (3,) or (3,4) or ('one',) /w mandatory comma
We can add other ones like duration, timespan, complex
numbers, and other items later on.
| And here are the remaining open issues:
| - The separator lines. What's the verdict on that one?
| Alternatives are: (1) none - use a top-level list
| production; (2) '----'; (3) empty lines.
| - The top level production. Alternatives are (1) map -
| requires separator lines; (2) map or list; (3) node.
| My favorite: (1) no separators and (3) node as the top-level
| production, but I don't feel strongly about it. You and Brian
| agree on something and I'll write it in.
Let's put node as the top level production, Brian objects
strongly to (2), and I like your use of new lines to
introduce carriage returns within a scalar.
| - Keys. I assume the model keeps keys as strings, that is
| that map == function string -> value, right? We don't
| want to allow (I'm using the ` notation here):
| `tuple (1,2) : &12 most
| `*12 : general
| `12 : implementation
| `~ : of
| "12" : maps
This needs some serious re-work. We need implicit and
explicit scalar values here, and given implicit scalars,
probably some way for a multi-line implicit scalar.
(230,"this is a
string", 34) : "value"
Not that I'm happy with the above, but to make the
serialization of tuples readable, you need something
like the above.
| - Character set. Someone with access to the physical Unicode
| spec should review the Character/Whitespace/Printable/Alpha
| productions. I suspect that we should simply refer to the
| Unicode "character properties" instead of explicitly listing
| the characters. Unicode keeps getting updated.
We need two character productions, one for the information
model, which includes 0x0-0x1F and one for the serialization
format which excludes 0x0-0x1F. Other than that, there is
no harm in excluding the surrogate pair range and specifying
the maximal range of unicode -- that isn't going to change.
We can refer to the unicode specification for other more
"specific" restrictions, but I think the general production
(qualified with a restriction reference) should be adequate.
 infchar ::= [#x0-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 seqchar ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Where the spec mentions that additional characters may
be excluded per the unicode specification. Characters
in infchar, but not in seqchar must be escaped.
| - Anchor/type to a reference/null: I assume these don't make
| sense. However if we treat references and nulls as "detected
| types" (now that we have this concept) I suggest we leave this
| restriction out of the production and note it when we describe
| these types.
An anchor of a null does make sense, and then a reference
to that anchor does make sense. An anchor of a reference
does not make sense, and should be restricted.
| That seems to be it. The weekend looms (actually started here).
| I could re-draft the spec this Saturday if you guys gave me the
| go-ahead and a definite word on these remaining "minor" issues.
| Should I mark it "release candidate"?
| (OK, nit-picking hat off).