Neil Watkiss [mailto:neilw@...] wrote:
> > > 1. The spec does not require a BOM for UTF-8 and it seems
> > > that this is industry practice
> > Can you provide pointers for that?
> These were the pages that started it all :)
Hmmm. None of these talk about making BOM mandatory in UTF-8, except for
Paul Prescod. Not exactly "standard industry practice"...
> Now for the rest of your email. Here's how libyaml works right now:
> 1. If there's no BOM and the user has told libyaml to autodetect the
> encoding, it picks the DEFAULT_ENCODING. That has changed
> twice in as many
> days: it started as UTF-8, migrated to ASCII, and is
> currently ISO-8859-1.
> I have no problem changing it back :)
So far, so good.
> 2. The user may specify any of these encodings:
> YAML_ENCODING_DETECT /* detect based on BOM; falls back
> to DEFAULT */
> YAML_ENCODING_UTF16 /* BOM required */
> YAML_ENCODING_UTF32 /* BOM required */
> YAML_ENCODING_ASCII /* not really needed anymore */
As you point out, ASCII isn't necessary. And the spec currently doesn't
cover LATIN1 at all. I don't see why we would want to support LATIN1 - every
LATIN1 file is a file which I can't view in my Hebrew-enabled applications
(or Greek, or Russian, or any other legacy 8-bit code-page based application
- *even if it supports UTF-8 in addition*). If every YAML file was UTF-*, no
exceptions, then there would be no issue at all viewing it on any
Unicode-enabled application in the world.
> If we change the default back to UTF-8, then the only way
> libyaml will
> accept Latin-1 input is for the user to specify it as
> such. This is a
> significant "barrier to entry", if you will.
This is the point I don't get. How does insisting on UTF-8 make a barrier to
entry? It isn't as if you have LATIN1 YAML files you have to deal with.
> 3. As a special encouragement to use UTF encodings, I'll
> sprinkle 'sleep'
> calls randomly throughout the Latin-1 and ASCII transformations. :)
Yeah, right :-)
> > 4. The fact is that in western Europe and the USA it is
> easier to work with
> > iso-latin files than with UTF-8 files. However, I expect
> that in time, as
> > operating systems are migrating to Unicode, this will cease
> to be the case
> > and UTF-8 would become the default encoding of text files.
> I don't want YAML
> > to carry baggage from "the bad old days" when this happens.
> Hogwash! So far, every Loader being implemented (Perl, Python, Java)
> is sitting in the middle of a rich Unicode library. Perl's native
> string format is UTF-8; Python's and Java's are UTF-16. There is no
> problem with UTF.
Exactly my point; it doesn't hurt you much (or any) to just use UTF-8, so
there's no need to support LATIN1 (or Hebrew, or Greek, or Russian, or ...).
> > So, I'd rather just keep things as they are. If you want to
> > change the way
> > the spec handles encoding, a better focus would be "how to
> > concatenate
> > UTF-16 streams".
> Are you referring to accidentally introducing BOMs in the middle of
> the content?
Yes. Now that the issue has been raised, I'm not certain what a YAML parser
should do when seeing a BOM in the stream. The current spec implies that it
would be treated in the same way as any other printable Unicode character
(that is, in the same way as the character 'z'). Which means that it can't
appear just before a document separator. However, if one concatenates YAML
streams which start with a BOM, this will happen.
So, we can:
1. Make it illegal (today's spec);
2. Make "BOM '-' '-' '-'" a legal separator, as long as the encoding isn't
3. Allow this BOM to specify an in-stream change of encoding.
I think option (2) makes the most sense. Thoughts?
This still leaves open the question of whether a BOM is *required* for UTF-8
files. I still see no good reason it should be (required).