From: Clark C . E. <cc...@cl...> - 2002-04-13 02:05:17
|
On Fri, Apr 12, 2002 at 05:15:51PM -0700, Neil Watkiss wrote: | I just discovered that Microsoft Notepad saves UTF-8 data with a UTF-8 BOM of | \xef\xbb\xbf. Then I found a whole thread about Python's UTF-8 codec -- it | seems it strips *every* BOM character from the entire stream, not just the | actual BOM itself. [1] This violates #22 of the Unicode BOM FAQ [3]. It's not good since the BOM is an actual Unicode character, ZERO WIDTH NON BREAKING SPACE. | As another comment, I didn't even know UTF-8 had a BOM Yep. See #29 of the same FAQ [4] | According to Microsoft, the UTF-8 BOM serves to differenciate UTF-8 | data from non-Unicode data, like Latin1. The YAML spec specifically | says every YAML stream is Unicode, so I don't have to accept Latin1 My guess is that we would mis-interpret a Latin1 file as UTF-8 and if we were lucky it would make an illegal character combination... Perhaps this is a bug in the spec? Perhaps we need to only allow ASCII and if any 8 bit characters are found without a UTF-8 BOM we raise an error. This would probably be the most "forward-thinking" as we may decide to allow a "--- #ENCODING:8859-1" at a later date... Since your parser would be the first to implement Unicode, I doubt we have any implementations out there that would prevent us from fixing this... thoughts? | I should still recognize and strip it, right? Yep! Best, Clark References: 3. http://www.unicode.org/unicode/faq/utf_bom.html#22 4. http://www.unicode.org/unicode/faq/utf_bom.html#29 -- Clark C. Evans Axista, Inc. http://www.axista.com 800.926.5525 XCOLLA Collaborative Project Management Software |