[icu-design] ICU API proposal: new class for UTS #46 (IDNA2008)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear ICU team and users,

I propose adding a new class and associated constants for UTS
#46<http://www.unicode.org/reports/tr46/>/IDNA2008
processing.

We have existing IDNA2003
C<http://icu-project.org/apiref/icu4c/uidna_8h.html#_details>and
Java <http://icu-project.org/apiref/icu4j/com/ibm/icu/text/IDNA.html> API,
but I decided to define a whole new class for the new version, for several
reasons:

   - The existing API uses static functions, which requires synchronization
   at each function entry; we have been trying to get away from that.
   - Static functions also means that we would have to distinguish between
   the old and new versions via options bits, which is bad for code structure
   and dependencies. The new version requires totally different code and data.
   - The existing API (at least in C, and we have a bug for Java) does not
   follow the IDNA spec's rule that the toUnicode operation should never fail.
   It does fail with a failure UErrorCode (or Java exception), and in that case
   does not return the processed string.
   - We only have C API, not C++. For C++ users, that's inconvenient: It's
   much easier to get a UnicodeString than dealing with buffer overflows.
   - The old API always performs BiDi checks for IDNA2003, driven by the
   NamePrep data file. The new API makes it optional.

The proposed API has an abstract interface with a factory method. Worker
functions take an IDNAErrors class object parameter which is currently just
a container of error bits but could be extended later if we wanted to
provide more error details. I defined an error bit for each type of error
that is easily identifiable in the code. In the proposed API, IDNA
processing errors (disallowed character, failed BiDi check, etc.) do not
result in a failure UErrorCode and would not result in a Java exception.

I have no plans to implement UTS #46 in the old API, nor IDNA2003 in the new
API.

The new code replaces disallowed characters with U+FFFD, as recommended by
UTS #46. This is not optional, because it's baked into the data file. The
code also adds U+FFFD in a couple of other error cases to give a visual
indication that something went wrong, so that in an error case the processed
string is usable for safe display to the user.

ToUnicode never fails (but can report IDNA errors) and always returns a
processed string, but when toASCII finds IDNA errors, it returns a bogus
(and empty) string. The exception to "never fails" are non-IDNA error
conditions like out-of-memory and failure to load the data file.

Questions about the API:

   1. Do we have the right options?
   2. Do we have a reasonable set of error bits? It would be hard to make
   more distinctions, due to the use of a Normalizer2 data file for much of the
   processing; we could collapse some error bits into more generic errors if
   there are "too many". (At the extreme, we could collapse them all to a
   single error bit, but there is already IDNAErrors::hasErrors() for
   convenience.)
      1. Note: I am not completely sure what a user will do with the errors
      other than a boolean indication that something is wrong.
   3. Please see the TODOs in the header file.

Also, it is not entirely obvious when and where U+FFFD should be written.
For that, please see some of the TODOs in the implementation code, and
generally how I handle errors.

I have not worked on C and Java APIs yet. They are intended to be very
parallel to the C++ version. I will send details once I have them.

Header:
http://bugs.icu-project.org/trac/browser/icu/branches/markus/uts46/source/common/unicode/idna.h
Implementation:
http://bugs.icu-project.org/trac/browser/icu/branches/markus/uts46/source/common/uts46.cpp

I am also copying most of the header file below, for a searchable record of
the API proposal.

Ticket: http://bugs.icu-project.org/trac/ticket/7144
Designated API reviewer: Mark

Please provide feedback by next Thursday, apr22.

Sincerely,
markus

--- most of the idna.h header ---

/*
 * IDNA option bit set values.
 */
enum {
    // TODO: Options from old API are mostly usable with the new API as
well.
    // All options should be moved to a C header.
    // They are actually still defined in uidna.h right now and thus
commented out here.
    // TODO: It should be safe to replace the old #defines with enum
constants, right?
    /**
     * Default options value: None of the other options are set.
     * @stable ICU 2.6
     */
    // UIDNA_DEFAULT=0,
    /**
     * Option to allow unassigned code points in domain names and labels.
     * This option is ignored by the UTS46 implementation.
     * @stable ICU 2.6
     */
    // UIDNA_ALLOW_UNASSIGNED=1,
    /**
     * Option to check whether the input conforms to the STD3 ASCII rules,
     * for example the restriction of labels to LDH characters
     * (ASCII Letters, Digits and Hyphen-Minus).
     * @stable ICU 2.6
     */
    // UIDNA_USE_STD3_RULES=2,
    /**
     * IDNA option to check for whether the input conforms to the BiDi
rules.
     * @draft ICU 4.6
     */
    UIDNA_CHECK_BIDI=4,
    /**
     * IDNA option to check for whether the input conforms to the CONTEXTJ
rules.
     * @draft ICU 4.6
     */
    UIDNA_CHECK_CONTEXTJ=8,
    /**
     * IDNA option for nontransitional processing in ToASCII().
     * By default, ToASCII() uses transitional processing.
     * @draft ICU 4.6
     */
    UIDNA_NONTRANSITIONAL_TO_ASCII=0x10
};

/*
 * IDNA error bit set values.
 * When a domain name or label fails a processing step or does not meet the
 * validity criteria, then one or more of these error bits are set.
 */
enum {
    // TODO: Should we combine the length errors into one single
UIDNA_ERROR_LABEL_OR_NAME_LENGTH?
    /**
     * A non-final domain name label (or the whole domain name) is empty.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_EMPTY_LABEL=1,
    /**
     * A domain name label is longer than 63 bytes.
     * (See STD13/RFC1034 3.1. Name space specifications and terminology.)
     * This is only checked in ToASCII operations, and only if the
UIDNA_USE_STD3_RULES is set.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_LABEL_TOO_LONG=2,
    /**
     * A domain name is longer than 255 bytes in its storage form.
     * (See STD13/RFC1034 3.1. Name space specifications and terminology.)
     * This is only checked in ToASCII operations, and only if the
UIDNA_USE_STD3_RULES is set.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_DOMAIN_NAME_TOO_LONG=4,
    // TODO: Should we combine the hyphen errors into one single
UIDNA_ERROR_BAD_HYPHEN?
    /**
     * A label starts with a hyphen-minus ('-').
     * @draft ICU 4.6
     */
    UIDNA_ERROR_LEADING_HYPHEN=8,
    /**
     * A label ends with a hyphen-minus ('-').
     * @draft ICU 4.6
     */
    UIDNA_ERROR_TRAILING_HYPHEN=0x10,
    /**
     * A label contains hyphen-minus ('-') in the third and fourth
positions.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_HYPHEN_3_4=0x20,  // TODO: Is this a reasonable name?
    /**
     * A label starts with a combining mark.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_LEADING_COMBINING_MARK=0x40,
    /**
     * A label or domain name contains disallowed characters.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_DISALLOWED=0x80,
    /**
     * A label starts with "xn--" but does not contain valid Punycode.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_PUNYCODE=0x100,
    /**
     * A label contains a dot=full stop.
     * This can occur in an ACE label, and in an input string for a
single-label function.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_LABEL_HAS_DOT=0x200,
    /**
     * An ACE label is not valid.
     * It might contain characters that are not allowed in ACE labels,
     * or it might not be normalized, or both.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_INVALID_ACE_LABEL=0x400,
    /**
     * A label does not meet the IDNA BiDi requirements (for right-to-left
characters).
     * @draft ICU 4.6
     */
    UIDNA_ERROR_BIDI=0x800,
    /**
     * A label does not meet the IDNA CONTEXTJ requirements.
     * @draft ICU 4.6
     */
    UIDNA_ERROR_CONTEXTJ=0x1000
};

/**
 * Abstract base class for IDNA processing.
 * See http://www.unicode.org/reports/tr46/
 * and http://www.ietf.org/rfc/rfc3490.txt
 *
 * This newer API currently only implements UTS #46.
 * The older uidna.h C API only implements IDNA2003.
 * @draft ICU 4.6
 */
class U_COMMON_API IDNA : public UObject {
public:
    /**
     * Returns an IDNA instance which implements UTS #46.
     * Returns an unmodifiable instance, owned by the caller.
     * Cache it for multiple operations, and delete it when done.
     *
     * UTS #46 defines Unicode IDNA Compatibility Processing,
     * updated to the latest version of Unicode and compatible with both
     * IDNA2003 and IDNA2008.
     *
     * ToASCII operations use transitional processing, including deviation
mappings,
     * unless the UIDNA_NONTRANSITIONAL_TO_ASCII is used.
     * ToUnicode operations always use nontransitional processing,
     * passing deviation characters through without change.
     *
     * Disallowed characters are mapped to U+FFFD.
     *
     * For available options see the uidna.h header as well as this header.
     * Operations with the UTS #46 instance do not support the
     * UIDNA_ALLOW_UNASSIGNED option.
     *
     * By default, UTS #46 disallows all ASCII characters other than
     * letters, digits, hyphen (LDH) and dot/full stop.
     * When the UIDNA_USE_STD3_RULES option is used, all ASCII characters
are treated as
     * valid or mapped.
     *
     * TODO: Do we need separate toASCIIOptions and toUnicodeOptions?
     *       That is, would users commonly want different options for the
     *       toASCII and toUnicode operations?
     *
     * @param options Bit set to modify the processing and error checking.
     * @param errorCode Standard ICU error code. Its input value must
     *                  pass the U_SUCCESS() test, or else the function
returns
     *                  immediately. Check for U_FAILURE() on output or use
with
     *                  function chaining. (See User Guide for details.)
     * @return the UTS #46 IDNA instance, if successful
     * @draft ICU 4.6
     */
    static IDNA *
    createUTS46Instance(uint32_t options, UErrorCode &errorCode);

    /**
     * Converts a single domain name label into its ASCII form for DNS
lookup.
     * ToASCII can fail if the input label cannot be converted into an ASCII
form.
     * In this case, the destination string will be bogus and
errors.hasErrors() will be TRUE.
     *
     * The UErrorCode indicates an error only in exceptional cases,
     * such as a U_MEMORY_ALLOCATION_ERROR.
     *
     * @param label Input domain name label
     * @param dest Destination string object
     * @param errors Output container of IDNA processing errors.
     * @param errorCode Standard ICU error code. Its input value must
     *                  pass the U_SUCCESS() test, or else the function
returns
     *                  immediately. Check for U_FAILURE() on output or use
with
     *                  function chaining. (See User Guide for details.)
     * @return dest
     * @draft ICU 4.6
     */
    virtual UnicodeString &
    labelToASCII(const UnicodeString &label, UnicodeString &dest,
                 IDNAErrors &errors, UErrorCode &errorCode) const = 0;

    /**
     * Converts a single domain name label into its Unicode form for
human-readable display.
     * ToUnicode never fails. If any processing step fails, then the input
label
     * is returned, possibly with modifications according to the types of
errors,
     * and errors.hasErrors() will be TRUE.
     *
     * For available options see the uidna.h header.
     *
     * @param label Input domain name label
     * @param dest Destination string object
     * @param errors Output container of IDNA processing errors.
     * @param errorCode Standard ICU error code. Its input value must
     *                  pass the U_SUCCESS() test, or else the function
returns
     *                  immediately. Check for U_FAILURE() on output or use
with
     *                  function chaining. (See User Guide for details.)
     * @return dest
     * @draft ICU 4.6
     */
    virtual UnicodeString &
    labelToUnicode(const UnicodeString &label, UnicodeString &dest,
                   IDNAErrors &errors, UErrorCode &errorCode) const = 0;

    /**
     * Converts a whole domain name into its ASCII form for DNS lookup.
     * ToASCII can fail if the input label cannot be converted into an ASCII
form.
     * In this case, the destination string will be bogus and
errors.hasErrors() will be TRUE.
     *
     * The UErrorCode indicates an error only in exceptional cases,
     * such as a U_MEMORY_ALLOCATION_ERROR.
     *
     * @param label Input domain name label
     * @param dest Destination string object
     * @param errors Output container of IDNA processing errors.
     * @param errorCode Standard ICU error code. Its input value must
     *                  pass the U_SUCCESS() test, or else the function
returns
     *                  immediately. Check for U_FAILURE() on output or use
with
     *                  function chaining. (See User Guide for details.)
     * @return dest
     * @draft ICU 4.6
     */
    virtual UnicodeString &
    nameToASCII(const UnicodeString &name, UnicodeString &dest,
                IDNAErrors &errors, UErrorCode &errorCode) const = 0;

    /**
     * Converts a whole domain name into its Unicode form for human-readable
display.
     * ToUnicode never fails. If any processing step fails, then the input
domain name
     * is returned, possibly with modifications according to the types of
errors,
     * and errors.hasErrors() will be TRUE.
     *
     * @param label Input domain name label
     * @param dest Destination string object
     * @param errors Output container of IDNA processing errors.
     * @param errorCode Standard ICU error code. Its input value must
     *                  pass the U_SUCCESS() test, or else the function
returns
     *                  immediately. Check for U_FAILURE() on output or use
with
     *                  function chaining. (See User Guide for details.)
     * @return dest
     * @draft ICU 4.6
     */
    virtual UnicodeString &
    nameToUnicode(const UnicodeString &name, UnicodeString &dest,
                  IDNAErrors &errors, UErrorCode &errorCode) const = 0;

    /**
     * ICU "poor man's RTTI", returns a UClassID for this class.
     * @returns a UClassID for this class.
     * @draft ICU 4.6
     */
    static UClassID U_EXPORT2 getStaticClassID();

    /**
     * ICU "poor man's RTTI", returns a UClassID for the actual class.
     * @return a UClassID for the actual class.
     * @draft ICU 4.6
     */
    virtual UClassID getDynamicClassID() const = 0;
};

/**
 * Output container for IDNA processing errors.
 * @draft ICU 4.6
 */
class U_COMMON_API IDNAErrors : public UObject {
public:
    /**
     * Constructor for stack allocation.
     * @draft ICU 4.6
     */
    IDNAErrors() : errors(0) {}
    /**
     * Were there IDNA processing errors?
     * @return TRUE if there were processing errors
     * @draft ICU 4.6
     */
    UBool hasErrors() const { return errors!=0; }
    /**
     * Returns a bit set indicating IDNA processing errors.
     * See UIDNA_ERROR_... constants.
     * @return bit set of processing errors
     * @draft ICU 4.6
     */
    uint32_t getErrors() const { return errors; }

    /**
     * ICU "poor man's RTTI", returns a UClassID for this class.
     * @returns a UClassID for this class.
     * @draft ICU 4.6
     */
    static UClassID U_EXPORT2 getStaticClassID();

    /**
     * ICU "poor man's RTTI", returns a UClassID for the actual class.
     * @return a UClassID for the actual class.
     * @draft ICU 4.6
     */
    virtual UClassID getDynamicClassID() const;
};

[icu-design] ICU API proposal: new class for UTS #46 (IDNA2008)

Open Source C/C++/Java libraries from Unicode

[icu-design] ICU API proposal: new class for UTS #46 (IDNA2008)