From: SourceForge.net <no...@so...> - 2005-10-03 18:24:31
|
Feature Requests item #1165752, was opened at 2005-03-17 23:44 Message generated for change (Comment added) made by dgp You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 10. Objects Group: None Status: Open Resolution: None Priority: 7 Submitted By: Andy Goth (andygoth) Assigned to: Jeffrey Hobbs (hobbs) Summary: New encodings: utf-16, utf-16be, utf-16le Initial Comment: The "unicode" encoding provides something like utf-16ne, where ne is native endianness. This isn't very useful, and it requires scripts to specially process byte arrays (swap byte order, strip/add BOM) passing into [encoding convertfrom] and out of [encoding convertto]. What's needed are "utf-16", "utf-16be", and "utf-16le" encodings. By all means, keep "unicode" as it is to avoid breaking stuff. But these new encodings will more closely adhere with the standards of the same name. "utf-16" will correctly handle an initial BOM (byte-order mark): If present, it is stripped but affects the byte ordering of the rest of the string. If absent, big-endian order is assumed. If found in the middle of the string, it is taken as ZWNBSP (zero-width non-breaking space). And when translating to "utf-16", an initial BOM is prepended. "utf-16le" and "utf-16be" will assume little-endian and big-endian byte orderings, respectively, and will treat BOMs as ZWNBSPs. (Or perhaps I have misread the standard.) I have seen comments to the effect that the BOM should never be stripped because it conveys information (the byte ordering) that may be useful at the script level. But why would the programmer want to know the byte ordering of a utf-16 string? I can't think of a good reason. All I can come up with is, knowing the byte order allows the programmer to decode the utf-16 string in multiple chunks, only the first of which contains BOM. But this is invalid because the programmer can't legally seek to a random byte location and must therefore decode everything as a single unit. So I conclude that the BOM is part of the encoding and not the payload. ---------------------------------------------------------------------- >Comment By: Don Porter (dgp) Date: 2005-10-03 14:24 Message: Logged In: YES user_id=80530 sounds fine to me; I;ll leave them alone then. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2005-10-03 13:03 Message: Logged In: YES user_id=72656 You can, but I don't thinik that these require a TIP. I wrote up something about it on tcl-core a while back, and the general response was "just do it". I hit a problem with a certain issue of byte swapping at one point, which is why I didn't commit ... need to revisit it. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-10-03 12:04 Message: Logged In: YES user_id=80530 Want to add these to TIP 258 ? ---------------------------------------------------------------------- Comment By: miguel sofer (msofer) Date: 2005-04-27 16:38 Message: Logged In: YES user_id=148712 Not my field of expertise ... passing the ball around. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894 |