Menu

#573 Byte-order mark symbol is decoded as \uFEFF

open
5
2009-12-08
2009-12-07
No

There is a byte-order mark mechanism in Unicode encodings, used to identify the byte-order of UCS2, UCS4 etc, and to identify that the stream IS utf-8.
When a string is converted from utf-8 (or from "unicode", which is UCS2 in native byte order), byte-order mark becomes \uFEFF and appears in the output.

It's theoretically useful for applications that have to distinguish between marked and unmarked files or strings, but it creates obscure difficulties for all other applications. I don't know what should be done (ignoring BOM when converting from utf-8 may be not a good option), but we need at least _some_ way to convert data from unicode/utf-8 that doesn't leave byte order mark in the internal string representation.

Maybe some new encoding(s) should be introduced, like unicode-bom (autodetecting between utf-8,ucs2-LE/BE etc on input, stripping BOM on input, appending BOM on output). I'm not sure what is to be done (and even unsure that the problem is to be solved on TCL level).

Discussion

  • Anton Kovalenko

    Anton Kovalenko - 2009-12-07

    On the second thought, maybe it should be handled when reading the utf8-encoded channel; something like fconfigure $fh -ignorebom true, or -bomchar \uFEFF, to skip BOM when reading the file. Other conversions done on the channel level (CR/LF/EOF) are "of the same spirit", so to speak.

     
  • Donal K. Fellows

    • labels: 105681 --> 322381
    • milestone: 897103 -->
    • assigned_to: nijtmans --> dgp
     
  • Don Porter

    Don Porter - 2009-12-08

    I like the channel options approach better myself.

     
  • Don Porter

    Don Porter - 2009-12-08
    • labels: 322381 --> 25. Channel System
    • assigned_to: dgp --> andreas_kupries