Re: [Yaml-core] Byte Order Markers

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, Apr 12, 2002 at 05:15:51PM -0700, Neil Watkiss wrote:
| I just discovered that Microsoft Notepad saves UTF-8 data with a UTF-8 BOM of
| \xef\xbb\xbf. Then I found a whole thread about Python's UTF-8 codec -- it
| seems it strips *every* BOM character from the entire stream, not just the
| actual BOM itself. [1]

This violates #22 of the Unicode BOM FAQ [3].  It's not good since
the BOM is an actual Unicode character, ZERO WIDTH NON BREAKING SPACE.

| As another comment, I didn't even know UTF-8 had a BOM

Yep.  See #29 of the same FAQ [4]

| According to Microsoft, the UTF-8 BOM serves to differenciate UTF-8 
| data from non-Unicode data, like Latin1. The YAML spec specifically 
| says every YAML stream is Unicode, so I don't have to accept Latin1

My guess is that we would mis-interpret a Latin1 file as UTF-8 and
if we were lucky it would make an illegal character combination...
Perhaps this is a bug in the spec?  Perhaps we need to only allow
ASCII and if any 8 bit characters are found without a UTF-8 BOM
we raise an error.   This would probably be the most "forward-thinking"
as we may decide to allow a "--- #ENCODING:8859-1" at a later date...

Since your parser would be the first to implement Unicode, I doubt
we have any implementations out there that would prevent us from
fixing this... thoughts?

| I should still recognize and strip it, right?

Yep!

Best,

Clark

References:
  3. http://www.unicode.org/unicode/faq/utf_bom.html#22
  4. http://www.unicode.org/unicode/faq/utf_bom.html#29

-- 
Clark C. Evans                   Axista, Inc.
http://www.axista.com            800.926.5525
XCOLLA Collaborative Project Management Software