From: Oren Ben-K. <or...@ri...> - 2002-04-16 08:41:00
|
Clark C . Evans [14/04/02 21:34 -0400]: > Ok. In the last week we seem to have two open issues: > > 1. The spec does not require a BOM for UTF-8 and it seems > that this is industry practice Can you provide pointers for that? > so that legacy encodings > can be handled. Also, without requiring UTF-8, it is > a bit harder to do a #ENCODING:ISO8859-1 at a later date, > for example. Kittens! :-) This is pretty horrible... 1. It encourages people to continue using non-UTF encodings. As someone in Israel, who has to handle Hebrew files, work with western *and eastern* European people, and people in the far east, the lure of "just working with UTF" is something I greatly appreciate. I do *not* want to go the XML way where one has to go hunt for the Greek (ISO) encoding or whatever. 2. *There is no legacy YAML data*. I realize that a lot of people have legacy non-YAML data. But all YAML files will be created either by hand-editing or using emitters, and I don't see the problem in requiring all such new files to be in UTF-8. 3. The Unicode FAQ has this following entry: "It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons: On POSIX systems, the locale and not magic file type codes define the encoding of plain text files. Mixing the two concepts would add a lot of complexity and break existing functionality. Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for "#!" at the beginning of a plaintext executable to locate the appropriate interpreter. Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one." I happen to like POSIX :-) All the above reasons apply to YAML files (yes, even the "#!" one). 4. The fact is that in western Europe and the USA it is easier to work with iso-latin files than with UTF-8 files. However, I expect that in time, as operating systems are migrating to Unicode, this will cease to be the case and UTF-8 would become the default encoding of text files. I don't want YAML to carry baggage from "the bad old days" when this happens. 5. As for the #ENCODING tag, it raises really nasty issues, as Neil has pointed out. We'll be regretting this directive for years. So, I'd rather just keep things as they are. If you want to change the way the spec handles encoding, a better focus would be "how to concatenate UTF-16 streams". Have fun, Oren Ben-Kiki |