RE: [Yaml-core] open issues: 8-bit BOM and lookahead

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Clark C . Evans [14/04/02 21:34 -0400]:
> Ok.  In the last week we seem to have two open issues:
> 
> 1. The spec does not require a BOM for UTF-8 and it seems
>    that this is industry practice

Can you provide pointers for that?

>    so that legacy encodings
>    can be handled.  Also, without requiring UTF-8, it is 
>    a bit harder to do a #ENCODING:ISO8859-1 at a later date,
>    for example.  

Kittens! :-)

This is pretty horrible...

1. It encourages people to continue using non-UTF encodings. As someone in
Israel, who has to handle Hebrew files, work with western *and eastern*
European people, and people in the far east, the lure of "just working with
UTF" is something I greatly appreciate. I do *not* want to go the XML way
where one has to go hunt for the Greek (ISO) encoding or whatever.

2. *There is no legacy YAML data*. I realize that a lot of people have
legacy non-YAML data. But all YAML files will be created either by
hand-editing or using emitters, and I don't see the problem in requiring all
such new files to be in UTF-8.

3. The Unicode FAQ has this following entry:

"It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as
a signature to mark the beginning of a UTF-8 file. This practice should
definitely not be used on POSIX systems for several reasons: 

On POSIX systems, the locale and not magic file type codes define the
encoding of plain text files. Mixing the two concepts would add a lot of
complexity and break existing functionality. 

Adding a UTF-8 signature at the start of a file would interfere with many
established conventions such as the kernel looking for "#!" at the beginning
of a plaintext executable to locate the appropriate interpreter.

Handling BOMs properly would add undesirable complexity even to simple
programs like cat or grep that mix contents of several files into one."

I happen to like POSIX :-) All the above reasons apply to YAML files (yes,
even the "#!" one).

4. The fact is that in western Europe and the USA it is easier to work with
iso-latin files than with UTF-8 files. However, I expect that in time, as
operating systems are migrating to Unicode, this will cease to be the case
and UTF-8 would become the default encoding of text files. I don't want YAML
to carry baggage from "the bad old days" when this happens.

5. As for the #ENCODING tag, it raises really nasty issues, as Neil has
pointed out. We'll be regretting this directive for years.

So, I'd rather just keep things as they are. If you want to change the way
the spec handles encoding, a better focus would be "how to concatenate
UTF-16 streams".

Have fun,

	Oren Ben-Kiki