Tcl / Read-Only Feature Requests / #392 New encodings: utf-16, utf-16be, utf-16le

The Tool Command Language implementation

#392 New encodings: utf-16, utf-16be, utf-16le

Status: open

Owner: Jeffrey Hobbs

Labels: 10. Objects (26)

Priority: 7

Updated: 2005-10-03

Created: 2005-03-18

Creator: Andy Goth

Private: No

The "unicode" encoding provides something like
utf-16ne, where ne is native endianness. This isn't
very useful, and it requires scripts to specially
process byte arrays (swap byte order, strip/add BOM)
passing into [encoding convertfrom] and out of
[encoding convertto].

What's needed are "utf-16", "utf-16be", and "utf-16le"
encodings. By all means, keep "unicode" as it is to
avoid breaking stuff. But these new encodings will
more closely adhere with the standards of the same name.

"utf-16" will correctly handle an initial BOM
(byte-order mark): If present, it is stripped but
affects the byte ordering of the rest of the string.
If absent, big-endian order is assumed. If found in
the middle of the string, it is taken as ZWNBSP
(zero-width non-breaking space). And when translating
to "utf-16", an initial BOM is prepended.

"utf-16le" and "utf-16be" will assume little-endian and
big-endian byte orderings, respectively, and will treat
BOMs as ZWNBSPs.

(Or perhaps I have misread the standard.)

I have seen comments to the effect that the BOM should
never be stripped because it conveys information (the
byte ordering) that may be useful at the script level.
But why would the programmer want to know the byte
ordering of a utf-16 string? I can't think of a good
reason. All I can come up with is, knowing the byte
order allows the programmer to decode the utf-16 string
in multiple chunks, only the first of which contains
BOM. But this is invalid because the programmer can't
legally seek to a random byte location and must
therefore decode everything as a single unit. So I
conclude that the BOM is part of the encoding and not
the payload.

Discussion

Donal K. Fellows - 2005-03-18

labels: 322376 --> 10. Objects

assigned_to: dkf --> msofer
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

miguel sofer - 2005-04-27

assigned_to: msofer --> dgp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

miguel sofer - 2005-04-27

Logged In: YES
user_id=148712

Not my field of expertise ... passing the ball around.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2005-05-31

assigned_to: dgp --> hobbs
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-10-03

Logged In: YES
user_id=80530

Want to add these to TIP 258 ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2005-10-03

priority: 5 --> 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2005-10-03

Logged In: YES
user_id=72656

You can, but I don't thinik that these require a TIP. I
wrote up something about it on tcl-core a while back, and
the general response was "just do it". I hit a problem with
a certain issue of byte swapping at one point, which is why
I didn't commit ... need to revisit it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-10-03

Logged In: YES
user_id=80530

sounds fine to me; I;ll leave them alone then.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

afredd - 2006-12-07

Logged In: YES
user_id=1386588
Originator: NO

*bump*

Am curious - is this likely to make it for 8.5?
Would definately be useful :^)

Additionally, support for utf32le & utf32be makes sense.
Also what would the behaviour be for reading in characters
outside Tcl's UCS-2 character set?

If there's a patch floating about, i'm willing to do some
additional testing if it's needed.

Cheers, afredd.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Goth - 2008-12-18

Any news on this feature request? Jeff, did you ever find your old code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Goth - 2010-06-22

Re-bump. Any progress?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

New encodings: utf-16, utf-16be, utf-16le

The Tool Command Language implementation

Group

Searches

Help

#392 New encodings: utf-16, utf-16be, utf-16le

Discussion