From: Brian I. <in...@tt...> - 2004-09-07 02:09:08
|
OK. On to the next topic. Oren asked Clark and I to come to some resolution on whether or not YAML directives were allowed after the first document in a stream. Clark and I had a long talk on the phone and went both ways on the issue. We finally decided to propose the following: - Every document in a stream can be preceded by what is called a "prologue". - The prologue consists of an optional BOM, followed by a series of zero or more directives. - Comment lines can be interspersed throughout the prologue. - A prologue with no BOM means UTF-8 is used for encoding. - A prologue terminates the previous document. - All documents that have a prologue, must start with the '---' header token. - All documents that have no prologue, use the prologue of the document before it. - The default prologue is no BOM and no directives. - A new prologue completely eliminates the previous prologue. There are no additive properties. - A %YAML version directive is required in a prologue that contains any other directive (like %TAG). - Directives always begin with '%' in column zero which is otherwise forbidden. - One directive per line. - We get rid of all directives in the *document* header. Directives only happen at the stream level now. The old style is a parse error under 1.1. The basic use case is that most handcrafted YAML streams will have one prologue at the top. This proposal lets us concatenate any of these and still get what we want. Here's a specific. Suppose it becomes popular to make streams containing a schema doc followed by instance docs. If you were handcrafting the stream you could put all your TAG directives for both schema and instances at the top. (As long as you use different prefixes). But let's say that you keep the schema in one file, and collections of instances in other files. Now you can concatente the schema file onto the top of the instances right before you ship it off, and everything will work. Cheers, Brian PS One wart I can see is that you can't have a headless "config file" using UTF-16 (ie a BOM). Not a big deal. There may be other warts. |
From: David H. <dav...@bl...> - 2004-09-07 03:52:20
|
Brian Ingerson wrote: > - Every document in a stream can be preceded by what is called a > "prologue". > - The prologue consists of an optional BOM, followed by a series of > zero or more directives. > - Comment lines can be interspersed throughout the prologue. > - A prologue with no BOM means UTF-8 is used for encoding. I'm extremely dubious about allowing different character encodings within a stream. Most programming languages' stream classes are not designed to handle this; at least not without heroic efforts. I realise that the motivation for this is to be able to concatenate arbitrary documents as sequences of bytes. However, it is almost always the case in practice that there is a "preferred encoding" -- typically either UTF-8 or native-order UTF-16 -- for any given environment (language or YAML API), and that documents in the "wrong" encoding or byte order are converted to the preferred one. In that case there is no problem with concatenating documents as sequences of characters. Allowing mixed encodings will IMHO complicate more cases than it simplifies. -- David Hopwood <dav...@bl...> |
From: Brian I. <in...@tt...> - 2004-09-07 06:03:45
|
On 07/09/04 04:52 +0100, David Hopwood wrote: > Brian Ingerson wrote: > >- Every document in a stream can be preceded by what is called a > > "prologue". > >- The prologue consists of an optional BOM, followed by a series of > > zero or more directives. > >- Comment lines can be interspersed throughout the prologue. > >- A prologue with no BOM means UTF-8 is used for encoding. > > I'm extremely dubious about allowing different character encodings > within a stream. Most programming languages' stream classes are not > designed to handle this; at least not without heroic efforts. > > I realise that the motivation for this is to be able to concatenate > arbitrary documents as sequences of bytes. However, it is almost always > the case in practice that there is a "preferred encoding" -- typically > either UTF-8 or native-order UTF-16 -- for any given environment > (language or YAML API), and that documents in the "wrong" encoding or > byte order are converted to the preferred one. In that case there is no > problem with concatenating documents as sequences of characters. Allowing > mixed encodings will IMHO complicate more cases than it simplifies. Good point. I just chatted with Oren. We both support the prologue concept I outlined previously, with the addition that is an error to change character encodings. In other words, you can only concatenate streams of the same encoding. Cheers, Brian |
From: Clark C. E. <cc...@cl...> - 2004-09-11 21:36:44
|
summary: This is a summary of Brian's 'prologue' proposal, taking into account feedback and going through with a fine comb. With the introduction of YAML 1.1 we allow a set of instructions to the YAML Processor, called directives, to apply to more than one document. These directives are clumped together into what is called a 'prologue'. detail: - A YAML 'prologue' is defined as an optional byte order mark (BOM), followed by zero or more directives, followed by an optional document start token ('---'). Comments can be interspersed throught a prologue. - Documents within a YAML stream always begin with a prologue. If there are more than one document in a YAML stream, the document start token is mandatory for all prologues. - Directives always begin with '%' in column zero and span exactly one line. Old-style directives in the document header are a parse error. A comment may appear on the same line as a directive, indicated by '# '. - The YAML stream has a single character encoding determined by an optional BOM character in the first prologue of the stream. If this BOM character is missing, the UTF-8 encoding is implied. Further, the BOM of subsequent prologues must match the BOM of the first prologue (if any). - Within a prologue, if any directive is given, the document start token ('---') and the %YAML version directive are required. - The 'active' set of directives for a given document comes from the most recent prologue that has directives. This is not additive, if a prologue has a directive, then subsequent documents use only the directives from that prologue, and not directives from previous prologues. |
From: David H. <dav...@bl...> - 2004-09-12 00:05:24
|
Clark C. Evans wrote: > summary: > > This is a summary of Brian's 'prologue' proposal, taking into > account feedback and going through with a fine comb. > > With the introduction of YAML 1.1 we allow a set of instructions > to the YAML Processor, called directives, to apply to more than > one document. These directives are clumped together into what > is called a 'prologue'. > > detail: > > - A YAML 'prologue' is defined as an optional byte order mark > (BOM), followed by zero or more directives, followed by an > optional document start token ('---'). Comments can be > interspersed throught a prologue. > > - Documents within a YAML stream always begin with a prologue. > If there are more than one document in a YAML stream, the > document start token is mandatory for all prologues. > > - Directives always begin with '%' in column zero and span exactly > one line. Old-style directives in the document header are a > parse error. A comment may appear on the same line as a directive, > indicated by '# '. > > - The YAML stream has a single character encoding determined by > an optional BOM character in the first prologue of the stream. > If this BOM character is missing, the UTF-8 encoding is > implied. Always UTF-8, even though UTF-16 can be auto-detected despite the missing BOM? First two bytes Encoding --------------------------- FE FF UTF-16BE 00 xx UTF-16BE (xx != 00) FF FE UTF-16LE xx 00 UTF-16LE (xx != 00) otherwise UTF-8 (Strictly speaking, this doesn't work for a document without any directives or "---" token, where the first character is non-ISO-Latin-1 and the encoding is UTF-16LE or UTF-16BE. If the "---" were required then it would always work.) > Further, the BOM of subsequent prologues must match > the BOM of the first prologue (if any). I assume that subsequent prologues can include or not include a BOM independently of whether the first prologue did, as long as the entire stream is in the same encoding? > - Within a prologue, if any directive is given, the document > start token ('---') and the %YAML version directive are > required. > > - The 'active' set of directives for a given document comes from > the most recent prologue that has directives. This is not > additive, if a prologue has a directive, then subsequent > documents use only the directives from that prologue, and > not directives from previous prologues. Sounds good to me. -- David Hopwood <dav...@bl...> |
From: Oren Ben-K. <or...@be...> - 2004-09-12 04:03:30
|
On Sunday 12 September 2004 03:05, David Hopwood wrote: > > - The YAML stream has a single character encoding determined by > > an optional BOM character in the first prologue of the stream. > > If this BOM character is missing, the UTF-8 encoding is > > implied. > > Always UTF-8, Yes. > even though UTF-16 can be auto-detected despite the > missing BOM? It can't, because the first character of a YAML stream may be any (printable) Unicode character. It need not be an ASCII character. Have fun, Oren Ben-Kiki |
From: David H. <dav...@bl...> - 2004-09-12 16:00:49
|
Oren Ben-Kiki wrote: > On Sunday 12 September 2004 03:05, David Hopwood wrote: > >>>- The YAML stream has a single character encoding determined by >>> an optional BOM character in the first prologue of the stream. >>> If this BOM character is missing, the UTF-8 encoding is >>> implied. >> >>Always UTF-8, > > Yes. > >>even though UTF-16 can be auto-detected despite the >>missing BOM? > > It can't, because the first character of a YAML stream may be any > (printable) Unicode character. It need not be an ASCII character. True, but that's because the '---' is optional for the first document of the stream. Why is it optional? -- David Hopwood <dav...@bl...> |
From: Oren Ben-K. <or...@be...> - 2004-09-12 16:20:27
|
On Sunday 12 September 2004 19:00, David Hopwood wrote: > >>even though UTF-16 can be auto-detected despite the > >>missing BOM? > > > > It can't, because the first character of a YAML stream may be any > > (printable) Unicode character. It need not be an ASCII character. > > True, but that's because the '---' is optional for the first document > of the stream. Why is it optional? Well, '---' is nice in the context of a stream, when you have multiple documents per file. In such a use case, "good style" would probably be to have '---' at the start of the first document too. But there are also the people who write, say, an Apache config file, and they couldn't care less about multiple documents per stream and would resent having to have a '---' at the start of their config file just because it is "good style" in log files to have it. Making the '---' optional is the simplest way to allow each group to have its way. YAML is full of this sort of "win-win" mechanisms... the %TAG and imploict typing debate is just another example in a long, long series of this sort of debate...l Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2004-09-12 16:22:30
|
On Sun, Sep 12, 2004 at 05:00:44PM +0100, David Hopwood wrote: | True, but that's because the '---' is optional for the first document | of the stream. Why is it optional? Ingy says so. ;) However, with the current spec, it can also start with '#' (besides the plain scalar), and with this proposal for 1.1, it could start with '%' for directives. So, the question is, if limited to these three start characters % # - would it be possible to auto-detect UTF16. Equivalently, if you take BE and LE byte layouts for the above characters, are they legal UTF8 for the plain scalar production? If not, then I suppose something like this could be possible. Clark |
From: Damian C. <dam...@gm...> - 2004-09-13 13:13:31
|
On Sat, 11 Sep 2004 17:36:43 -0400, Clark C. Evans <cc...@cl...> wrote: > > - A YAML 'prologue' is defined as an optional byte order mark > (BOM), followed by zero or more directives, followed by an > optional document start token ('---'). [...] I was going to say I don't think the BOM should be considered part of the prologue, because it is a detail of the encoding (with UTF-16, technically the BOM at the start of the byte sequence is not part of the character data). Most of the YAML syntax definition is in terms of characters, not bytes; the encoding of the character data as a byte sequence (using UTF-16 or UTF-8) is a separate matter. Then I remembered that you want to be able to concatenate separate files together without needing to strip off the BOM, which is why it is convenient to allow a spurious character U+FEFF to be skipped at the start of each document in the stream. Which is fair enough. Technically speaking, UTF-16 data is permitted to have no BOM (it is assumed to be in big-endian order). We probably want to forbid that, in case there is some contrived UTF-16BE sequence that collides with UTF-8... The spec should also allow for APIs to be fed a character string (sequenc of Unicode characters) as opposed to a byte string, and to not insist on spurious BOMs being added in this case. > Further, the BOM of subsequent prologues must match > the BOM of the first prologue (if any). You should probably insert '(if any)' there so that subsequent documents are not *required* to have a BOM just because the stream as a whole does. > - The 'active' set of directives for a given document comes from > the most recent prologue that has directives. This is not > additive, if a prologue has a directive, then subsequent > documents use only the directives from that prologue, and > not directives from previous prologues. If you are concatentating streams, you will need to take note of streams lacking directives and insert directives that cancel out any directives in force from the preceeding stream. This will not be too hard, but will be a lot more bother than chopping off the spurious BOM would have been...! :-) -- Damian -- Damian Cugley, Alleged Literature http://www.alleged.org.uk/pdc/ |