[Tcl-bugs] [ tcl-Feature Requests-1165752 ] New encodings: utf-16, utf-16be, utf-16le

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Feature Requests item #1165752, was opened at 2005-03-17 23:44
Message generated for change (Comment added) made by dgp
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 10. Objects
Group: None
Status: Open
Resolution: None
Priority: 7
Submitted By: Andy Goth (andygoth)
Assigned to: Jeffrey Hobbs (hobbs)
Summary: New encodings: utf-16, utf-16be, utf-16le

Initial Comment:
The "unicode" encoding provides something like
utf-16ne, where ne is native endianness.  This isn't
very useful, and it requires scripts to specially
process byte arrays (swap byte order, strip/add BOM)
passing into [encoding convertfrom] and out of
[encoding convertto].

What's needed are "utf-16", "utf-16be", and "utf-16le"
encodings.  By all means, keep "unicode" as it is to
avoid breaking stuff.  But these new encodings will
more closely adhere with the standards of the same name.

"utf-16" will correctly handle an initial BOM
(byte-order mark):  If present, it is stripped but
affects the byte ordering of the rest of the string. 
If absent, big-endian order is assumed.  If found in
the middle of the string, it is taken as ZWNBSP
(zero-width non-breaking space).  And when translating
to "utf-16", an initial BOM is prepended.

"utf-16le" and "utf-16be" will assume little-endian and
big-endian byte orderings, respectively, and will treat
BOMs as ZWNBSPs.

(Or perhaps I have misread the standard.)

I have seen comments to the effect that the BOM should
never be stripped because it conveys information (the
byte ordering) that may be useful at the script level.
 But why would the programmer want to know the byte
ordering of a utf-16 string?  I can't think of a good
reason.  All I can come up with is, knowing the byte
order allows the programmer to decode the utf-16 string
in multiple chunks, only the first of which contains
BOM.  But this is invalid because the programmer can't
legally seek to a random byte location and must
therefore decode everything as a single unit.  So I
conclude that the BOM is part of the encoding and not
the payload.

----------------------------------------------------------------------

>Comment By: Don Porter (dgp)
Date: 2005-10-03 14:24

Message:
Logged In: YES 
user_id=80530

sounds fine to me; I;ll leave them alone then.

----------------------------------------------------------------------

Comment By: Jeffrey Hobbs (hobbs)
Date: 2005-10-03 13:03

Message:
Logged In: YES 
user_id=72656

You can, but I don't thinik that these require a TIP.  I
wrote up something about it on tcl-core a while back, and
the general response was "just do it".  I hit a problem with
a certain issue of byte swapping at one point, which is why
I didn't commit ... need to revisit it.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-10-03 12:04

Message:
Logged In: YES 
user_id=80530

Want to add these to TIP 258 ?

----------------------------------------------------------------------

Comment By: miguel sofer (msofer)
Date: 2005-04-27 16:38

Message:
Logged In: YES 
user_id=148712

Not my field of expertise ... passing the ball around.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894

[Tcl-bugs] [ tcl-Feature Requests-1165752 ] New encodings: utf-16, utf-16be, utf-16le

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Feature Requests-1165752 ] New encodings: utf-16, utf-16be, utf-16le