|
From: Markus Scherer <markus.icu@gm...> - 2017-08-10 17:49:43
|
Dear ICU team & users, I would like to propose the following API for: ICU 60 Please provide feedback by: next Tuesday, 2017-08-15 Ticket: http://bugs.icu-project.org/trac/ticket/13311 *Proposal* Change the ICU behavior for illegal UTF-8 subsequences, adopting the behavior described in Unicode 6+ and in the W3C Encoding Standard. This will not change processing of valid UTF-8, nor change what strings are valid vs. invalid. This will not visibly change string transformations (e.g., normalization or case mapping) which treat illegal subsequences as "inert" and copy them as is to the output. This will change output from APIs where illegal sequences are counted, replaced with U+FFFD or equivalent, are sent to error handlers, or are visible as units of code point iteration. I can think of the following: - Conversion from UTF-8 to UTF-16 will tend to call error callbacks with shorter illegal sequences, and write more replacement characters. (C & Java character converters, and lower-level C string conversion functions like u_strFromUTF8().) - UTF-8 macros like U8_NEXT() and U8_BACK_1() will tend to skip over shorter illegal sequences per invocation. - UText iteration over illegal UTF-8. - Regular expressions matching a certain number of U+FFFD or ".". - Character (grapheme cluster) BreakIterator. - In collation of UTF-8 strings, more collation elements (equivalent to U+FFFD) will tend to be generated for illegal sequences. I suspect that most users will not notice the difference. The biggest impact might be that software that interacts with web standards can stop working around ICU behavior of illegal UTF-8. *Explanation* ICU decodes UTF-8 according to the original spec, as described for example in https://tools.ietf.org/html/rfc2279#section-2 (which has been obsoleted in 2003 by https://tools.ietf.org/html/rfc3629) - Bytes C0..FD are lead bytes and indicate 1..5 trail bytes. - The lead byte alone is sufficient for how many trail bytes to read. - Non-shortest forms, surrogates, and code points >10FFFF are forbidden and yield one error per sequence. - Truncated sequences that start as in the original spec but do not have enough trail bytes also yield one error for each truncated sequence. Since Unicode 6 <http://www.unicode.org/versions/Unicode6.0.0/> (October 2010), the standard (chapter 3) has “recommended” “best practices for using U+FFFD” where only truncated valid sequences yield a single error; otherwise each byte up to (and excluding) the next single or lead byte yields one error. (This is how non-UTF-8 MBCS converters work in ICU. The most natural implementation is via a state table.) For example, each of the following is a single error in ICU but yields one error per byte in the Unicode description: - C0 AF - E0 80 80 - ED A0 80 - F0 80 80 80 - F4 90 80 80 - FD 80 80 80 80 80 Truncations of valid sequences yield one single error either way: - E1 80 - F1 80 80 - etc. The Unicode 6+ “best practice” has become enshrined as a requirement in the W3C Encoding Standard <https://www.w3.org/TR/encoding/#utf-8-decoder>. As a result, software that interacts with web standards needs to work around ICU behavior of illegal UTF-8. This usually means carrying their own implementations of UTF-8 converters and such. Otherwise, implementations differ widely. The Unicode Standard does say that almost any behavior is ok, within very wide bounds (emit at least one FFFD for an illegal subsequence of one or more bytes). UTC meeting 151 (in May) decided to change the “best practice” to what ICU does, but UTC meeting 152 (last week) retracted that, and instead decided to keep describing the behavior outlined above. There has been discussion about whether to keep calling it “best practice”, and that discussion will continue, but this is the one behavior described in the standard in detail. (It also mentions emitting one FFFD per illegal-subsequence byte, and emitting one FFFD per whole subsequence.) Sincerely, markus |