From: SourceForge.net <no...@so...> - 2005-03-18 04:44:07
|
Feature Requests item #1165752, was opened at 2005-03-18 04:44 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894 Category: 16. Commands A-H Group: None Status: Open Resolution: None Priority: 5 Submitted By: Andy Goth (andygoth) Assigned to: Donal K. Fellows (dkf) Summary: New encodings: utf-16, utf-16be, utf-16le Initial Comment: The "unicode" encoding provides something like utf-16ne, where ne is native endianness. This isn't very useful, and it requires scripts to specially process byte arrays (swap byte order, strip/add BOM) passing into [encoding convertfrom] and out of [encoding convertto]. What's needed are "utf-16", "utf-16be", and "utf-16le" encodings. By all means, keep "unicode" as it is to avoid breaking stuff. But these new encodings will more closely adhere with the standards of the same name. "utf-16" will correctly handle an initial BOM (byte-order mark): If present, it is stripped but affects the byte ordering of the rest of the string. If absent, big-endian order is assumed. If found in the middle of the string, it is taken as ZWNBSP (zero-width non-breaking space). And when translating to "utf-16", an initial BOM is prepended. "utf-16le" and "utf-16be" will assume little-endian and big-endian byte orderings, respectively, and will treat BOMs as ZWNBSPs. (Or perhaps I have misread the standard.) I have seen comments to the effect that the BOM should never be stripped because it conveys information (the byte ordering) that may be useful at the script level. But why would the programmer want to know the byte ordering of a utf-16 string? I can't think of a good reason. All I can come up with is, knowing the byte order allows the programmer to decode the utf-16 string in multiple chunks, only the first of which contains BOM. But this is invalid because the programmer can't legally seek to a random byte location and must therefore decode everything as a single unit. So I conclude that the BOM is part of the encoding and not the payload. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=360894&aid=1165752&group_id=10894 |