From: Oren Ben-K. <or...@ri...> - 2002-04-16 08:41:00
|
Clark C . Evans [14/04/02 21:34 -0400]: > Ok. In the last week we seem to have two open issues: > > 1. The spec does not require a BOM for UTF-8 and it seems > that this is industry practice Can you provide pointers for that? > so that legacy encodings > can be handled. Also, without requiring UTF-8, it is > a bit harder to do a #ENCODING:ISO8859-1 at a later date, > for example. Kittens! :-) This is pretty horrible... 1. It encourages people to continue using non-UTF encodings. As someone in Israel, who has to handle Hebrew files, work with western *and eastern* European people, and people in the far east, the lure of "just working with UTF" is something I greatly appreciate. I do *not* want to go the XML way where one has to go hunt for the Greek (ISO) encoding or whatever. 2. *There is no legacy YAML data*. I realize that a lot of people have legacy non-YAML data. But all YAML files will be created either by hand-editing or using emitters, and I don't see the problem in requiring all such new files to be in UTF-8. 3. The Unicode FAQ has this following entry: "It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons: On POSIX systems, the locale and not magic file type codes define the encoding of plain text files. Mixing the two concepts would add a lot of complexity and break existing functionality. Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for "#!" at the beginning of a plaintext executable to locate the appropriate interpreter. Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one." I happen to like POSIX :-) All the above reasons apply to YAML files (yes, even the "#!" one). 4. The fact is that in western Europe and the USA it is easier to work with iso-latin files than with UTF-8 files. However, I expect that in time, as operating systems are migrating to Unicode, this will cease to be the case and UTF-8 would become the default encoding of text files. I don't want YAML to carry baggage from "the bad old days" when this happens. 5. As for the #ENCODING tag, it raises really nasty issues, as Neil has pointed out. We'll be regretting this directive for years. So, I'd rather just keep things as they are. If you want to change the way the spec handles encoding, a better focus would be "how to concatenate UTF-16 streams". Have fun, Oren Ben-Kiki |
From: Oren Ben-K. <or...@ri...> - 2002-04-16 10:17:34
|
Neil Watkiss [mailto:neilw@ActiveState.com] wrote: > > > 1. The spec does not require a BOM for UTF-8 and it seems > > > that this is industry practice > > > > Can you provide pointers for that? > > These were the pages that started it all :) > > > http://msdn.microsoft.com/library/en-us/intl/unicode_42jv.asp?frame=true > http://mail.python.org/pipermail/i18n-sig/2001-May/000883.html > http://www.unicode.org/unicode/faq/utf_bom.html Hmmm. None of these talk about making BOM mandatory in UTF-8, except for Paul Prescod. Not exactly "standard industry practice"... > Now for the rest of your email. Here's how libyaml works right now: > > 1. If there's no BOM and the user has told libyaml to autodetect the > encoding, it picks the DEFAULT_ENCODING. That has changed > twice in as many > days: it started as UTF-8, migrated to ASCII, and is > currently ISO-8859-1. > I have no problem changing it back :) So far, so good. > 2. The user may specify any of these encodings: > > YAML_ENCODING_DETECT /* detect based on BOM; falls back > to DEFAULT */ > YAML_ENCODING_UTF8 > YAML_ENCODING_UTF16 /* BOM required */ > YAML_ENCODING_UTF32 /* BOM required */ > YAML_ENCODING_UTF16BE > YAML_ENCODING_UTF16LE > YAML_ENCODING_UTF32BE > YAML_ENCODING_UTF32LE > YAML_ENCODING_ASCII /* not really needed anymore */ > YAML_ENCODING_LATIN1 As you point out, ASCII isn't necessary. And the spec currently doesn't cover LATIN1 at all. I don't see why we would want to support LATIN1 - every LATIN1 file is a file which I can't view in my Hebrew-enabled applications (or Greek, or Russian, or any other legacy 8-bit code-page based application - *even if it supports UTF-8 in addition*). If every YAML file was UTF-*, no exceptions, then there would be no issue at all viewing it on any Unicode-enabled application in the world. > If we change the default back to UTF-8, then the only way > libyaml will > accept Latin-1 input is for the user to specify it as > such. This is a > significant "barrier to entry", if you will. This is the point I don't get. How does insisting on UTF-8 make a barrier to entry? It isn't as if you have LATIN1 YAML files you have to deal with. > 3. As a special encouragement to use UTF encodings, I'll > sprinkle 'sleep' > calls randomly throughout the Latin-1 and ASCII transformations. :) Yeah, right :-) > > 4. The fact is that in western Europe and the USA it is > easier to work with > > iso-latin files than with UTF-8 files. However, I expect > that in time, as > > operating systems are migrating to Unicode, this will cease > to be the case > > and UTF-8 would become the default encoding of text files. > I don't want YAML > > to carry baggage from "the bad old days" when this happens. > > Hogwash! So far, every Loader being implemented (Perl, Python, Java) > is sitting in the middle of a rich Unicode library. Perl's native > string format is UTF-8; Python's and Java's are UTF-16. There is no > problem with UTF. Exactly my point; it doesn't hurt you much (or any) to just use UTF-8, so there's no need to support LATIN1 (or Hebrew, or Greek, or Russian, or ...). > > So, I'd rather just keep things as they are. If you want to > > change the way > > the spec handles encoding, a better focus would be "how to > > concatenate > > UTF-16 streams". > > Are you referring to accidentally introducing BOMs in the middle of > the content? Yes. Now that the issue has been raised, I'm not certain what a YAML parser should do when seeing a BOM in the stream. The current spec implies that it would be treated in the same way as any other printable Unicode character (that is, in the same way as the character 'z'). Which means that it can't appear just before a document separator. However, if one concatenates YAML streams which start with a BOM, this will happen. So, we can: 1. Make it illegal (today's spec); 2. Make "BOM '-' '-' '-'" a legal separator, as long as the encoding isn't changed; 3. Allow this BOM to specify an in-stream change of encoding. I think option (2) makes the most sense. Thoughts? This still leaves open the question of whether a BOM is *required* for UTF-8 files. I still see no good reason it should be (required). Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2002-04-16 15:33:52
|
On Tue, Apr 16, 2002 at 06:19:28AM -0400, Oren Ben-Kiki wrote: | > YAML_ENCODING_LATIN1 | | As you point out, ASCII isn't necessary. And the spec currently doesn't | cover LATIN1 at all. I don't see why we would want to support LATIN1 I was asking for it initially since I have alot of data (in a database) encoded as LATIN1. However, I've found that I can convert this to UTF-8 within python fairly easily, thus migrating to UTF-8 shouldn't be hard. | This is the point I don't get. How does insisting on UTF-8 make a barrier to | entry? It isn't as if you have LATIN1 YAML files you have to deal with. The only good argument that I've heared for LATIN1 is that some text editors don't support editing UTF-8. I've never used anything but ASCII... so I wouldn't know! (ignorant american) | Yes. Now that the issue has been raised, I'm not certain what a YAML parser | should do when seeing a BOM in the stream. The current spec implies that it | would be treated in the same way as any other printable Unicode character | (that is, in the same way as the character 'z'). Which means that it can't | appear just before a document separator. However, if one concatenates YAML | streams which start with a BOM, this will happen. | | So, we can: | 1. Make it illegal (today's spec); | 2. Make "BOM '-' '-' '-'" a legal separator, as long as the encoding isn't | changed; | 3. Allow this BOM to specify an in-stream change of encoding. | | I think option (2) makes the most sense. Thoughts? I also like #2, #3 is probably not going to work for most text editors that support UTF. | This still leaves open the question of whether a BOM is *required* for UTF-8 | files. I still see no good reason it should be (required). You convinced me with the "#!" argument. I think this is one big advantage of YAML over XML... you can specify the target program for the YAML data. Let's not loose this wonderful feature... ;) Clark |
From: Neil W. <neilw@ActiveState.com> - 2002-04-16 09:44:44
|
Oren Ben-Kiki [16/04/02 04:42 -0400]: > Clark C . Evans [14/04/02 21:34 -0400]: > > 1. The spec does not require a BOM for UTF-8 and it seems > > that this is industry practice > > Can you provide pointers for that? These were the pages that started it all :) http://msdn.microsoft.com/library/en-us/intl/unicode_42jv.asp?frame=true http://mail.python.org/pipermail/i18n-sig/2001-May/000883.html http://www.unicode.org/unicode/faq/utf_bom.html Now for the rest of your email. Here's how libyaml works right now: 1. If there's no BOM and the user has told libyaml to autodetect the encoding, it picks the DEFAULT_ENCODING. That has changed twice in as many days: it started as UTF-8, migrated to ASCII, and is currently ISO-8859-1. I have no problem changing it back :) 2. The user may specify any of these encodings: YAML_ENCODING_DETECT /* detect based on BOM; falls back to DEFAULT */ YAML_ENCODING_UTF8 YAML_ENCODING_UTF16 /* BOM required */ YAML_ENCODING_UTF32 /* BOM required */ YAML_ENCODING_UTF16BE YAML_ENCODING_UTF16LE YAML_ENCODING_UTF32BE YAML_ENCODING_UTF32LE YAML_ENCODING_ASCII /* not really needed anymore */ YAML_ENCODING_LATIN1 If we change the default back to UTF-8, then the only way libyaml will accept Latin-1 input is for the user to specify it as such. This is a significant "barrier to entry", if you will. 3. As a special encouragement to use UTF encodings, I'll sprinkle 'sleep' calls randomly throughout the Latin-1 and ASCII transformations. :) > 4. The fact is that in western Europe and the USA it is easier to work with > iso-latin files than with UTF-8 files. However, I expect that in time, as > operating systems are migrating to Unicode, this will cease to be the case > and UTF-8 would become the default encoding of text files. I don't want YAML > to carry baggage from "the bad old days" when this happens. Hogwash! So far, every Loader being implemented (Perl, Python, Java) is sitting in the middle of a rich Unicode library. Perl's native string format is UTF-8; Python's and Java's are UTF-16. There is no problem with UTF. > So, I'd rather just keep things as they are. If you want to change the way > the spec handles encoding, a better focus would be "how to concatenate > UTF-16 streams". Are you referring to accidentally introducing BOMs in the middle of the content? Later, Neil |
From: Clark C . E. <cc...@cl...> - 2002-04-16 15:27:55
|
On Tue, Apr 16, 2002 at 04:42:52AM -0400, Oren Ben-Kiki wrote: | Adding a UTF-8 signature at the start of a file would interfere with many | established conventions such as the kernel looking for "#!" at the beginning | of a plaintext executable to locate the appropriate interpreter. This one is good enough for me, defaulting to UTF-8 is ok just for this reason. | 4. The fact is that in western Europe and the USA it is easier to work with | iso-latin files than with UTF-8 files. However, I expect that in time, as | operating systems are migrating to Unicode, this will cease to be the case | and UTF-8 would become the default encoding of text files. I don't want YAML | to carry baggage from "the bad old days" when this happens. Ok. The argument that I've heared for ISOLATIN1 comes from a few germans that I know. The editors that they use are not unicode aware. I have a question... how do I use "vim" to edit UTF-8? | 5. As for the #ENCODING tag, it raises really nasty issues, as Neil has | pointed out. We'll be regretting this directive for years. Ok. We can leave #ENCODING out. It seems in XML land that the most common way to detect the encoding is with a MIME wrapper of the XML file. I'd expect the same would be true with YAML. Best, Clark -- Clark C. Evans Axista, Inc. http://www.axista.com 800.926.5525 XCOLLA Collaborative Project Management Software |
From: Brian Q. <br...@sw...> - 2002-04-16 16:53:25
|
> Ok. We can leave #ENCODING out. It seems in XML land that the > most common way to detect the encoding is with a MIME wrapper of > the XML file. I'd expect the same would be true with YAML. If the MIME header is the only way of detecting the encoding of the XML file, then it is not really an XML file :-) XML requires an explicit encoding declaration for documents encoded in anything but UTF-8 and UTF-16. And UTF-16 documents must begin with a BOM. Cheers, Brian |