Tcl / Read-Only Feature Requests / #573 Byte-order mark symbol is decoded as \uFEFF

The Tool Command Language implementation

#573 Byte-order mark symbol is decoded as \uFEFF

Status: open

Owner: Andreas Kupries

Labels: 25. Channel System (16)

Priority: 5

Updated: 2009-12-08

Created: 2009-12-07

Creator: Anton Kovalenko

Private: No

There is a byte-order mark mechanism in Unicode encodings, used to identify the byte-order of UCS2, UCS4 etc, and to identify that the stream IS utf-8.
When a string is converted from utf-8 (or from "unicode", which is UCS2 in native byte order), byte-order mark becomes \uFEFF and appears in the output.

It's theoretically useful for applications that have to distinguish between marked and unmarked files or strings, but it creates obscure difficulties for all other applications. I don't know what should be done (ignoring BOM when converting from utf-8 may be not a good option), but we need at least _some_ way to convert data from unicode/utf-8 that doesn't leave byte order mark in the internal string representation.

Maybe some new encoding(s) should be introduced, like unicode-bom (autodetecting between utf-8,ucs2-LE/BE etc on input, stripping BOM on input, appending BOM on output). I'm not sure what is to be done (and even unsure that the problem is to be solved on TCL level).

Discussion

Anton Kovalenko - 2009-12-07

On the second thought, maybe it should be handled when reading the utf8-encoded channel; something like fconfigure $fh -ignorebom true, or -bomchar \uFEFF, to skip BOM when reading the file. Other conversions done on the channel level (CR/LF/EOF) are "of the same spirit", so to speak.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-12-08

labels: 105681 --> 322381

milestone: 897103 -->

assigned_to: nijtmans --> dgp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2009-12-08

I like the channel options approach better myself.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2009-12-08

labels: 322381 --> 25. Channel System

assigned_to: dgp --> andreas_kupries
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Byte-order mark symbol is decoded as \uFEFF

The Tool Command Language implementation

Group

Searches

Help

#573 Byte-order mark symbol is decoded as \uFEFF

Discussion