Menu

#392 New encodings: utf-16, utf-16be, utf-16le

open
7
2005-10-03
2005-03-18
Andy Goth
No

The "unicode" encoding provides something like
utf-16ne, where ne is native endianness. This isn't
very useful, and it requires scripts to specially
process byte arrays (swap byte order, strip/add BOM)
passing into [encoding convertfrom] and out of
[encoding convertto].

What's needed are "utf-16", "utf-16be", and "utf-16le"
encodings. By all means, keep "unicode" as it is to
avoid breaking stuff. But these new encodings will
more closely adhere with the standards of the same name.

"utf-16" will correctly handle an initial BOM
(byte-order mark): If present, it is stripped but
affects the byte ordering of the rest of the string.
If absent, big-endian order is assumed. If found in
the middle of the string, it is taken as ZWNBSP
(zero-width non-breaking space). And when translating
to "utf-16", an initial BOM is prepended.

"utf-16le" and "utf-16be" will assume little-endian and
big-endian byte orderings, respectively, and will treat
BOMs as ZWNBSPs.

(Or perhaps I have misread the standard.)

I have seen comments to the effect that the BOM should
never be stripped because it conveys information (the
byte ordering) that may be useful at the script level.
But why would the programmer want to know the byte
ordering of a utf-16 string? I can't think of a good
reason. All I can come up with is, knowing the byte
order allows the programmer to decode the utf-16 string
in multiple chunks, only the first of which contains
BOM. But this is invalid because the programmer can't
legally seek to a random byte location and must
therefore decode everything as a single unit. So I
conclude that the BOM is part of the encoding and not
the payload.

Discussion

  • Donal K. Fellows

    • labels: 322376 --> 10. Objects
    • assigned_to: dkf --> msofer
     
  • miguel sofer

    miguel sofer - 2005-04-27
    • assigned_to: msofer --> dgp
     
  • miguel sofer

    miguel sofer - 2005-04-27

    Logged In: YES
    user_id=148712

    Not my field of expertise ... passing the ball around.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2005-05-31
    • assigned_to: dgp --> hobbs
     
  • Don Porter

    Don Porter - 2005-10-03

    Logged In: YES
    user_id=80530

    Want to add these to TIP 258 ?

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2005-10-03
    • priority: 5 --> 7
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2005-10-03

    Logged In: YES
    user_id=72656

    You can, but I don't thinik that these require a TIP. I
    wrote up something about it on tcl-core a while back, and
    the general response was "just do it". I hit a problem with
    a certain issue of byte swapping at one point, which is why
    I didn't commit ... need to revisit it.

     
  • Don Porter

    Don Porter - 2005-10-03

    Logged In: YES
    user_id=80530

    sounds fine to me; I;ll leave them alone then.

     
  • afredd

    afredd - 2006-12-07

    Logged In: YES
    user_id=1386588
    Originator: NO

    *bump*

    Am curious - is this likely to make it for 8.5?
    Would definately be useful :^)

    Additionally, support for utf32le & utf32be makes sense.
    Also what would the behaviour be for reading in characters
    outside Tcl's UCS-2 character set?

    If there's a patch floating about, i'm willing to do some
    additional testing if it's needed.

    Cheers, afredd.

     
  • Andy Goth

    Andy Goth - 2008-12-18

    Any news on this feature request? Jeff, did you ever find your old code?

     
  • Andy Goth

    Andy Goth - 2010-06-22

    Re-bump. Any progress?

     
MongoDB Logo MongoDB