|
From: Markus S. <mar...@gm...> - 2010-04-15 19:10:37
|
Dear ICU team and users, I propose adding a new class and associated constants for UTS #46<http://www.unicode.org/reports/tr46/>/IDNA2008 processing. We have existing IDNA2003 C<http://icu-project.org/apiref/icu4c/uidna_8h.html#_details>and Java <http://icu-project.org/apiref/icu4j/com/ibm/icu/text/IDNA.html> API, but I decided to define a whole new class for the new version, for several reasons: - The existing API uses static functions, which requires synchronization at each function entry; we have been trying to get away from that. - Static functions also means that we would have to distinguish between the old and new versions via options bits, which is bad for code structure and dependencies. The new version requires totally different code and data. - The existing API (at least in C, and we have a bug for Java) does not follow the IDNA spec's rule that the toUnicode operation should never fail. It does fail with a failure UErrorCode (or Java exception), and in that case does not return the processed string. - We only have C API, not C++. For C++ users, that's inconvenient: It's much easier to get a UnicodeString than dealing with buffer overflows. - The old API always performs BiDi checks for IDNA2003, driven by the NamePrep data file. The new API makes it optional. The proposed API has an abstract interface with a factory method. Worker functions take an IDNAErrors class object parameter which is currently just a container of error bits but could be extended later if we wanted to provide more error details. I defined an error bit for each type of error that is easily identifiable in the code. In the proposed API, IDNA processing errors (disallowed character, failed BiDi check, etc.) do not result in a failure UErrorCode and would not result in a Java exception. I have no plans to implement UTS #46 in the old API, nor IDNA2003 in the new API. The new code replaces disallowed characters with U+FFFD, as recommended by UTS #46. This is not optional, because it's baked into the data file. The code also adds U+FFFD in a couple of other error cases to give a visual indication that something went wrong, so that in an error case the processed string is usable for safe display to the user. ToUnicode never fails (but can report IDNA errors) and always returns a processed string, but when toASCII finds IDNA errors, it returns a bogus (and empty) string. The exception to "never fails" are non-IDNA error conditions like out-of-memory and failure to load the data file. Questions about the API: 1. Do we have the right options? 2. Do we have a reasonable set of error bits? It would be hard to make more distinctions, due to the use of a Normalizer2 data file for much of the processing; we could collapse some error bits into more generic errors if there are "too many". (At the extreme, we could collapse them all to a single error bit, but there is already IDNAErrors::hasErrors() for convenience.) 1. Note: I am not completely sure what a user will do with the errors other than a boolean indication that something is wrong. 3. Please see the TODOs in the header file. Also, it is not entirely obvious when and where U+FFFD should be written. For that, please see some of the TODOs in the implementation code, and generally how I handle errors. I have not worked on C and Java APIs yet. They are intended to be very parallel to the C++ version. I will send details once I have them. Header: http://bugs.icu-project.org/trac/browser/icu/branches/markus/uts46/source/common/unicode/idna.h Implementation: http://bugs.icu-project.org/trac/browser/icu/branches/markus/uts46/source/common/uts46.cpp I am also copying most of the header file below, for a searchable record of the API proposal. Ticket: http://bugs.icu-project.org/trac/ticket/7144 Designated API reviewer: Mark Please provide feedback by next Thursday, apr22. Sincerely, markus --- most of the idna.h header --- /* * IDNA option bit set values. */ enum { // TODO: Options from old API are mostly usable with the new API as well. // All options should be moved to a C header. // They are actually still defined in uidna.h right now and thus commented out here. // TODO: It should be safe to replace the old #defines with enum constants, right? /** * Default options value: None of the other options are set. * @stable ICU 2.6 */ // UIDNA_DEFAULT=0, /** * Option to allow unassigned code points in domain names and labels. * This option is ignored by the UTS46 implementation. * @stable ICU 2.6 */ // UIDNA_ALLOW_UNASSIGNED=1, /** * Option to check whether the input conforms to the STD3 ASCII rules, * for example the restriction of labels to LDH characters * (ASCII Letters, Digits and Hyphen-Minus). * @stable ICU 2.6 */ // UIDNA_USE_STD3_RULES=2, /** * IDNA option to check for whether the input conforms to the BiDi rules. * @draft ICU 4.6 */ UIDNA_CHECK_BIDI=4, /** * IDNA option to check for whether the input conforms to the CONTEXTJ rules. * @draft ICU 4.6 */ UIDNA_CHECK_CONTEXTJ=8, /** * IDNA option for nontransitional processing in ToASCII(). * By default, ToASCII() uses transitional processing. * @draft ICU 4.6 */ UIDNA_NONTRANSITIONAL_TO_ASCII=0x10 }; /* * IDNA error bit set values. * When a domain name or label fails a processing step or does not meet the * validity criteria, then one or more of these error bits are set. */ enum { // TODO: Should we combine the length errors into one single UIDNA_ERROR_LABEL_OR_NAME_LENGTH? /** * A non-final domain name label (or the whole domain name) is empty. * @draft ICU 4.6 */ UIDNA_ERROR_EMPTY_LABEL=1, /** * A domain name label is longer than 63 bytes. * (See STD13/RFC1034 3.1. Name space specifications and terminology.) * This is only checked in ToASCII operations, and only if the UIDNA_USE_STD3_RULES is set. * @draft ICU 4.6 */ UIDNA_ERROR_LABEL_TOO_LONG=2, /** * A domain name is longer than 255 bytes in its storage form. * (See STD13/RFC1034 3.1. Name space specifications and terminology.) * This is only checked in ToASCII operations, and only if the UIDNA_USE_STD3_RULES is set. * @draft ICU 4.6 */ UIDNA_ERROR_DOMAIN_NAME_TOO_LONG=4, // TODO: Should we combine the hyphen errors into one single UIDNA_ERROR_BAD_HYPHEN? /** * A label starts with a hyphen-minus ('-'). * @draft ICU 4.6 */ UIDNA_ERROR_LEADING_HYPHEN=8, /** * A label ends with a hyphen-minus ('-'). * @draft ICU 4.6 */ UIDNA_ERROR_TRAILING_HYPHEN=0x10, /** * A label contains hyphen-minus ('-') in the third and fourth positions. * @draft ICU 4.6 */ UIDNA_ERROR_HYPHEN_3_4=0x20, // TODO: Is this a reasonable name? /** * A label starts with a combining mark. * @draft ICU 4.6 */ UIDNA_ERROR_LEADING_COMBINING_MARK=0x40, /** * A label or domain name contains disallowed characters. * @draft ICU 4.6 */ UIDNA_ERROR_DISALLOWED=0x80, /** * A label starts with "xn--" but does not contain valid Punycode. * @draft ICU 4.6 */ UIDNA_ERROR_PUNYCODE=0x100, /** * A label contains a dot=full stop. * This can occur in an ACE label, and in an input string for a single-label function. * @draft ICU 4.6 */ UIDNA_ERROR_LABEL_HAS_DOT=0x200, /** * An ACE label is not valid. * It might contain characters that are not allowed in ACE labels, * or it might not be normalized, or both. * @draft ICU 4.6 */ UIDNA_ERROR_INVALID_ACE_LABEL=0x400, /** * A label does not meet the IDNA BiDi requirements (for right-to-left characters). * @draft ICU 4.6 */ UIDNA_ERROR_BIDI=0x800, /** * A label does not meet the IDNA CONTEXTJ requirements. * @draft ICU 4.6 */ UIDNA_ERROR_CONTEXTJ=0x1000 }; /** * Abstract base class for IDNA processing. * See http://www.unicode.org/reports/tr46/ * and http://www.ietf.org/rfc/rfc3490.txt * * This newer API currently only implements UTS #46. * The older uidna.h C API only implements IDNA2003. * @draft ICU 4.6 */ class U_COMMON_API IDNA : public UObject { public: /** * Returns an IDNA instance which implements UTS #46. * Returns an unmodifiable instance, owned by the caller. * Cache it for multiple operations, and delete it when done. * * UTS #46 defines Unicode IDNA Compatibility Processing, * updated to the latest version of Unicode and compatible with both * IDNA2003 and IDNA2008. * * ToASCII operations use transitional processing, including deviation mappings, * unless the UIDNA_NONTRANSITIONAL_TO_ASCII is used. * ToUnicode operations always use nontransitional processing, * passing deviation characters through without change. * * Disallowed characters are mapped to U+FFFD. * * For available options see the uidna.h header as well as this header. * Operations with the UTS #46 instance do not support the * UIDNA_ALLOW_UNASSIGNED option. * * By default, UTS #46 disallows all ASCII characters other than * letters, digits, hyphen (LDH) and dot/full stop. * When the UIDNA_USE_STD3_RULES option is used, all ASCII characters are treated as * valid or mapped. * * TODO: Do we need separate toASCIIOptions and toUnicodeOptions? * That is, would users commonly want different options for the * toASCII and toUnicode operations? * * @param options Bit set to modify the processing and error checking. * @param errorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return the UTS #46 IDNA instance, if successful * @draft ICU 4.6 */ static IDNA * createUTS46Instance(uint32_t options, UErrorCode &errorCode); /** * Converts a single domain name label into its ASCII form for DNS lookup. * ToASCII can fail if the input label cannot be converted into an ASCII form. * In this case, the destination string will be bogus and errors.hasErrors() will be TRUE. * * The UErrorCode indicates an error only in exceptional cases, * such as a U_MEMORY_ALLOCATION_ERROR. * * @param label Input domain name label * @param dest Destination string object * @param errors Output container of IDNA processing errors. * @param errorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return dest * @draft ICU 4.6 */ virtual UnicodeString & labelToASCII(const UnicodeString &label, UnicodeString &dest, IDNAErrors &errors, UErrorCode &errorCode) const = 0; /** * Converts a single domain name label into its Unicode form for human-readable display. * ToUnicode never fails. If any processing step fails, then the input label * is returned, possibly with modifications according to the types of errors, * and errors.hasErrors() will be TRUE. * * For available options see the uidna.h header. * * @param label Input domain name label * @param dest Destination string object * @param errors Output container of IDNA processing errors. * @param errorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return dest * @draft ICU 4.6 */ virtual UnicodeString & labelToUnicode(const UnicodeString &label, UnicodeString &dest, IDNAErrors &errors, UErrorCode &errorCode) const = 0; /** * Converts a whole domain name into its ASCII form for DNS lookup. * ToASCII can fail if the input label cannot be converted into an ASCII form. * In this case, the destination string will be bogus and errors.hasErrors() will be TRUE. * * The UErrorCode indicates an error only in exceptional cases, * such as a U_MEMORY_ALLOCATION_ERROR. * * @param label Input domain name label * @param dest Destination string object * @param errors Output container of IDNA processing errors. * @param errorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return dest * @draft ICU 4.6 */ virtual UnicodeString & nameToASCII(const UnicodeString &name, UnicodeString &dest, IDNAErrors &errors, UErrorCode &errorCode) const = 0; /** * Converts a whole domain name into its Unicode form for human-readable display. * ToUnicode never fails. If any processing step fails, then the input domain name * is returned, possibly with modifications according to the types of errors, * and errors.hasErrors() will be TRUE. * * @param label Input domain name label * @param dest Destination string object * @param errors Output container of IDNA processing errors. * @param errorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return dest * @draft ICU 4.6 */ virtual UnicodeString & nameToUnicode(const UnicodeString &name, UnicodeString &dest, IDNAErrors &errors, UErrorCode &errorCode) const = 0; /** * ICU "poor man's RTTI", returns a UClassID for this class. * @returns a UClassID for this class. * @draft ICU 4.6 */ static UClassID U_EXPORT2 getStaticClassID(); /** * ICU "poor man's RTTI", returns a UClassID for the actual class. * @return a UClassID for the actual class. * @draft ICU 4.6 */ virtual UClassID getDynamicClassID() const = 0; }; /** * Output container for IDNA processing errors. * @draft ICU 4.6 */ class U_COMMON_API IDNAErrors : public UObject { public: /** * Constructor for stack allocation. * @draft ICU 4.6 */ IDNAErrors() : errors(0) {} /** * Were there IDNA processing errors? * @return TRUE if there were processing errors * @draft ICU 4.6 */ UBool hasErrors() const { return errors!=0; } /** * Returns a bit set indicating IDNA processing errors. * See UIDNA_ERROR_... constants. * @return bit set of processing errors * @draft ICU 4.6 */ uint32_t getErrors() const { return errors; } /** * ICU "poor man's RTTI", returns a UClassID for this class. * @returns a UClassID for this class. * @draft ICU 4.6 */ static UClassID U_EXPORT2 getStaticClassID(); /** * ICU "poor man's RTTI", returns a UClassID for the actual class. * @return a UClassID for the actual class. * @draft ICU 4.6 */ virtual UClassID getDynamicClassID() const; }; |