[icu-design] Fwd: C API for LocaleBuilder

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

---------- Forwarded message ---------
From: Frank Tang (譚永鋒) <ft...@go...>
Date: Fri, 30 Jun 2023 at 15:53
Subject: Re: [icu-design] C API for LocaleBuilder
To: Rich Gillam <ric...@ap...>

On Fri, 30 Jun 2023 at 11:00, Rich Gillam <ric...@ap...> wrote:

> Frank--
>
> That would be a different approach to construct a locale. However, I would
> like to point out several things here.
> 1. That approach won't be a "builder"
>
>
> But it would solve the problem we’re being asked to solve.
>

no, it does not. What the bug asked for is a C builder API and whatever the
API does not follow  a "builder" design pattern does not solve the problem
it asked for. A builder needs to be step by step so the decision of each
field could be spread in different times and different places in the code.
A single function with a struct does not allow that, and therefore not
fulfilling the request and requirement.

I am open to a concrete counter proposal. It is simple to think the issue
could be solved by a simple set() w/ a constant if you only consider the 1)
language, 2) script, 3) region, 4) variant, 5) languageTag, and 6) localeId
as the only set of input. But once you start to consider 7) attributes, 8)
unicode locale keyword/value, and 9) other extensions than "-u-" (for
example "-t-" or anything in other_extension) things become complicated.
The validation and override/replacement logic for each of them will be
different

>
> but what you describe would require the caller to resolve that before
> putting it into the struct.
>
>
> No, it wouldn’t.  For building locales, there’s nothing special going on
> while you’re setting up the builder— it’s just a bag of fields.
>

no, that is not true, it is NOT a bag of fields, it is an order list of
fields, not a bag => the order matter.
in bag sematic , the order does not matter

Consider the following

languageTag "en-Latn-GB-u-ca-roc"
language "ja"
script "Thai"
region "CN"
variant "hepburn-heploc"
unicode extension "ca"= "japanese"
language "sl"
variant "bohoric"
unicode extension "ca"= "coptic"

which will produce sl-Thai-CN-bohoric-u-ca-coptic"

but if you treat them as a bug and build that the order in
language "ja"
script "Thai"
region "CN"
variant "hepburn-heploc"
unicode extension "ca"= "japanese"
language "sl"
languageTag "en-Latn-GB-u-ca-roc"
variant "bohoric"
unicode extension "ca"= "coptic"

you will get
which will produce en-Latn-GB-bohoric-u-ca-coptic"

which result in different value

>  All the magic happens when you run the build() method.
>

That should be implementation depend and should not be mandated by the API,
there are nothing wrong to progressively build when each setting method got
called.

>  I’m just saying that maybe instead of adding a new opaque class with a
> bunch of individual getters and setters for all those fields, we just use a
> simple struct (or a hash table, or something).
>

As I mentioned, what you said only make sense when the underline problem is *a
bag of fields*, but the problem is this problem is not just a* bag of
fields.* The information pass into the builder have overlapping with each
other, in particular the information inside localeID and languageTag is not
multurelly exclusive from language/script/region/variant/attribute/unicode
locale keyword+value and other extension. They over lap and part of them
could be override by another. If all the input are muturally exclusive from
each other, then the problem is much simpler and possible could be solved
by "just use a simple struct" . But that is NOT the case.(as the example I
show above) now.

> The approach you mentioned seems to require the caller to do a lot of work
> to fill that struct. and the caller also needs to learn how to program
> UHashTable while using that API, is that better?
>
>
> Good point.  UHashTable is a pain in the butt.  Maybe that’s enough right
> there to swing me back around to what you’ve already proposed…
>
> That said, I do kind of like Steven’s suggestion to just have one set()
> method and just have constants for the individual fields...
>

If we look at the pre-existing locale C API in uloc.h we have different
function instead of a get w/ a constant
uloc_getLanguage(....
uloc_getScript(....
uloc_getCountry(....
uloc_getVariant(....
uloc_getName(....
uloc_getBaseName(...

notice there is no API to get attributes.

we could also use a uloc_get() with a constant, right? but we didn't years
ago. so... should we follow the same design pattern or to a set w/ constant

We have the following different kind of operation to the builder

1. language (w/o script and region), region, script, variant, language tag
(w/ script or region) => set
2. attribute => add and remove (there could be multiple attributes in a
locale
3. unicode extension set on a key ("ca" , "japanese")
4. other extension - set on a extension ('a', "abc-efg")
5. clear extension (say you start from "en-u-ca-japanese-nu-Thai" and you
want to clear the calendar and change nu to "Arab"   to "en-u-nu-Arab")

a simple set w/ constant method could simplify #1 above, but not powerful
enough to deal with #2 #3 #4 and #5
Also, I personally dislike the set + enum => internally swtich(enum)
approach because almost all the time the caller know what operation need to
perform in the compile time, but this approach mandate the function
internally switch based on a value which is fixed in the program but cannot
be easily optimized by the compiler.

Consider the following case

setA(context, value) {
   context.a = value;
}

setB(context, value) {
   context.b = value;
}
and the caller

setA(context, 3);
setB(context, 5);

vs
set(context, field, value) {
  switch(field) {
     case A: context.a = value; break;
     case B: context.b = value; break;
  }
}
and the caller
set(context, A, 3);
set(context, B, 5);

The swtich / case inside the set() simply cannot be optimized by the
compiler

 surely the first approach expose more functions but such exposure allow
the compiler to avoid unnecessary run time hit by the switch/case which
could be resovled in the compile time, not during the run time.

> —Rich
>
> On Jun 30, 2023, at 12:44 AM, Frank Tang (譚永鋒) <ft...@go...> wrote:
>
> Dear Richard
>
> On Thu, 29 Jun 2023 at 18:31, Richard Gillam <ric...@ap...>
> wrote:
>
>> Frank—
>>
>> This looks fine, but also kind of heavy-weight.  I wonder if it’d make
>> sense to just have a struct with all the fields and utility methods to
>> convert between a locale ID string and that struct?
>>
>
> That would be a different approach to construct a locale. However, I would
> like to point out several things here.
> 1. That approach won't be a "builder"
> The definition of a Builder design pattern is
> "Builder is a creational design pattern that lets you construct complex
> objects step by step. "
> "Problem
> Imagine a complex object that requires laborious, step-by-step
> initialization of many fields and nested objects. Such initialization code
> is usually buried inside a monstrous constructor with lots of parameters.
> Or even worse: scattered all over the client code."
>
> In other words, the key concept of "builder" itself is "step by step" and
> that allows the builder to get passing to different parts of the system to
> allow different part of the software to add information into the builder,
> without a centralized place to gather all the information together.
> see https://refactoring.guru/design-patterns/builder
>
> 2.  The step by step processing also means the timing of information
> putting in matter and the later information overrides the previous
> information.
> for example, I could have a
> ULocaleBuilder builder = ulb_open();
> ulb_setLanguageTag(builder, "en-Latn-GB-u-ca-islamic");
> ulb_setRegion(builder, "FR");
> ulb_setUnicodeExtension(builder, 'x', "abcd");
> .....
> ulb_setUnicodeLocaleKeyword(builder, "ca", "japanese");
> ulb_setRegion(builder, "DE");
> ulb_setUnicodeLocaleKeyword(builder, "ca", "roc");
> ulb_build(builder...)
> ulb_close(builder)
>
> to build a locale "en-Latn-DE-u-ca-roc-x-abcd"
>
> but what you describe would require the caller to resolve that before
> putting it into the struct.
>
>  I guess the weak point in that approach would be the key-value pairs at
>> the end of the locale ID— an alternative approach might be to use a
>> UHashTable (or whatever it's called) as the bearer of fields (or to provide
>> both the struct and the hash table, depending on whether the caller cares
>> about the key-value fields).  The beauty of an approach like this is that
>> you save a few heap allocations and don’t have to add as many new API
>> functions.
>>
>
> The approach you mentioned seems to require the caller to do a lot of work
> to fill that struct. and the caller also needs to learn how to program
> UHashTable while using that API, is that better?
>
>
>> On the other hand, what you have is a thin wrapper around the existing
>> C++ LocaleBuilder class, so you’re not having to implement the
>> locale-mangling code all over again.
>>
>> I don’t know— what does everybody else think?
>>
>> —Rich
>>
>> On Jun 29, 2023, at 5:09 PM, Frank Tang (譚永鋒) via icu-design <
>> icu...@li...> wrote:
>>
>>
>>
>> Dear ICU team & users,
>>
>>
>> I would like to propose the following for: ICU 74
>>
>> Please provide feedback by: Next Wednesday, July 5, or any time
>> sufficiently in advance of the feature freeze
>>
>> Designated API review: Gary Wade
>>
>> Issue: https://unicode-org.atlassian.net/browse/ICU-22365
>> A draft PR and implementation (not including test yet) could be found at
>> https://github.com/unicode-org/icu/pull/2520 )
>>
>> This is to follow up the C++ API in ICU 64 and provide a C API requested
>> in
>>
>> Issue to be discussed
>>
>> 1. Should the prefix be ulb_ or something else?
>> 2. What should ulb_build build? the value as returned by
>> Locale::getName() or Locale::toLanguageTag()?
>> 3. is it to use file ulocbuilder.{h,cpp}? or should it be something else?
>>
>> Add C API for ULocaleBuilder
>>
>> https://unicode-org.atlassian.net/browse/ICU-22365
>>
>> File : icu4c/source/common/ulocbuilder.h
>>
>> // © 2023 and later: Unicode, Inc. and others.
>> // License & terms of use: http://www.unicode.org/copyright.html
>> #ifndef __ULOCBUILDER_H__
>> #define __ULOCBUILDER_H__
>>
>> #include "unicode/utypes.h"
>>
>>
>> /**
>>  * \file
>>  * \brief C API: Builder API for Locale
>>  */
>>
>> #ifndef U_HIDE_DRAFT_API
>>
>> /**
>>  * Opaque type for a Locale builder.
>>  * @draft ICU 74
>>  */
>> typedef void *ULocaleBuilder;
>>
>> /**
>>  * <code>ULocaleBuilder</code> is used to build stirng of valid
>> <code>locale</code>
>>  * from values configured by the setters.
>>  * The <code>ULocaleBuilder</code> checks if a value configured by a
>>  * setter satisfies the syntax requirements defined by the
>> <code>Locale</code>
>>  * class.  A string of Locale created by a <code>ULocaleBuilder</code> is
>>  * well-formed and can be transformed to a well-formed IETF BCP 47
>> language tag
>>  * without losing information.
>>  *
>>  * <p>The following example shows how to create a <code>locale</code>
>> string
>>  * with the <code>ULocaleBuilder</code>.
>>  * <blockquote>
>>  * <pre>
>>  *     UErrorCode err = U_ZERO_ERROR;
>>  *     char buffer[ULOC_FULLNAME_CAPACITY];
>>  *     ULocaleBuilder builder = ulb_open();
>>  *     ulb_setLanguage(builder, "sr");
>>  *     ulb_setScript(builder, "Latn");
>>  *     ulb_setRegion(builder, "RS");
>>  *     int32_t length = ulb_build(
>>  *         builder, buffer, ULOC_FULLNAME_CAPACITY, &error);
>>  *     ulb_close(builder);
>>  * </pre>
>>  * </blockquote>
>>  *
>>  * <p>ULocaleBuilders can be reused; <code>ulb_clear()</code> resets all
>>  * fields to their default values.
>>  *
>>  * <p>ULocaleBuilder tracks errors in an internal UErrorCode. For all
>> setters,
>>  * except ulb_setLanguageTag and ulb_setLocale, ULocaleBuilder will
>> return immediately
>>  * if the internal UErrorCode is in error state.
>>  * To reset internal state and error code, call clear method.
>>  * The ulb_setLanguageTag and setLocale method will first clear the
>> internal
>>  * UErrorCode, then track the error of the validation of the input
>> parameter
>>  * into the internal UErrorCode.
>>  *
>>  * @draft ICU 74
>>  */
>>
>> /**
>>  * Constructs an empty ULocaleBuilder. The default value of all
>>  * fields, extensions, and private use information is the
>>  * empty string. The created builder should be destoried by calling
>>  * ulb_close();
>>  *
>>  * @draft ICU 74
>>  */
>> U_CAPI ULocaleBuilder U_EXPORT2
>> ulb_open();
>>
>> /**
>>  * Close the builder and destroy it's internal states.
>>  * @param builder the builder
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_close(ULocaleBuilder builder);
>>
>> /**
>>  * Resets the <code>ULocaleBuilder</code> to match the provided
>>  * <code>locale</code>.  Existing state is discarded.
>>  *
>>  * <p>All fields of the locale must be well-formed.
>>  * <p>This method clears the internal UErrorCode.
>>  *
>>  * @param builder the builder
>>  * @param locale the locale
>>  *
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setLocale(ULocaleBuilder builder, const char* locale);
>>
>> /**
>>  * Resets the ULocaleBuilder to match the provided IETF BCP 47 language
>> tag.
>>  * Discards the existing state.
>>  * The empty string causes the builder to be reset, like {@link #clear}.
>>  * Legacy language tags (marked as “Type: grandfathered” in BCP 47)
>>  * are converted to their canonical form before being processed.
>>  * Otherwise, the <code>language tag</code> must be well-formed,
>>  * or else the ulb_build() method will later report an
>> U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p>This method clears the internal UErrorCode.
>>  *
>>  * @param builder the builder
>>  * @param tag the language tag, defined as IETF BCP 47 language tag.
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setLanguageTag(ULocaleBuilder builder, const char* tag);
>>
>> /**
>>  * Sets the language.  If <code>language</code> is the empty string, the
>>  * language in this <code>ULocaleBuilder</code> is removed. Otherwise, the
>>  * <code>language</code> must be well-formed, or else the ulb_build()
>> method will
>>  * later report an U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p>The syntax of language value is defined as
>>  * [unicode_language_subtag](
>> http://www.unicode.org/reports/tr35/tr35.html#unicode_language_subtag).
>>  *
>>  * @param builder the builder
>>  * @param language the language
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setLanguage(ULocaleBuilder builder, const char* language);
>>
>> /**
>>  * Sets the script. If <code>script</code> is the empty string, the
>> script in
>>  * this <code>ULocaleBuilder</code> is removed.
>>  * Otherwise, the <code>script</code> must be well-formed, or else the
>> ulb_build()
>>  * method will later report an U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p>The script value is a four-letter script code as
>>  * [unicode_script_subtag](
>> http://www.unicode.org/reports/tr35/tr35.html#unicode_script_subtag)
>>  * defined by ISO 15924
>>  *
>>  * @param builder the builder
>>  * @param script the script
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setScript(ULocaleBuilder builder, const char* script);
>>
>> /**
>>  * Sets the region.  If region is the empty string, the region in this
>>  * <code>ULocaleBuilder</code> is removed. Otherwise, the
>> <code>region</code>
>>  * must be well-formed, or else the ulb_build() method will later report
>> an
>>  * U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p>The region value is defined by
>>  *  [unicode_region_subtag](
>> http://www.unicode.org/reports/tr35/tr35.html#unicode_region_subtag)
>>  * as a two-letter ISO 3166 code or a three-digit UN M.49 area code.
>>  *
>>  * <p>The region value in the <code>Locale</code> created by the
>>  * <code>ULocaleBuilder</code> is always normalized to upper case.
>>  *
>>  * @param builder the builder
>>  * @param region the region
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setRegion(ULocaleBuilder builder, const char* region);
>>
>> /**
>>  * Sets the variant.  If variant is the empty string, the variant in this
>>  * <code>LocaleBuilder</code> is removed.  Otherwise, the
>> <code>variant</code>
>>  * must be well-formed, or else the ulb_build() method will later report
>> an
>>  * U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p><b>Note:</b> This method checks if <code>variant</code>
>>  * satisfies the
>>  * [unicode_variant_subtag](
>> http://www.unicode.org/reports/tr35/tr35.html#unicode_variant_subtag)
>>  * syntax requirements, and normalizes the value to lowercase letters.
>> However,
>>  * the <code>Locale</code> class does not impose any syntactic
>>  * restriction on variant. To set an ill-formed variant, use a Locale
>> constructor.
>>  * If there are multiple unicode_variant_subtag, the caller must
>> concatenate
>>  * them with '-' as separator (ex: "foobar-fibar").
>>  *
>>  * @param builder the builder
>>  * @param variant the variant
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setVariant(ULocaleBuilder builder, const char* variant);
>>
>> /**
>>  * Sets the extension for the given key. If the value is the empty string,
>>  * the extension is removed.  Otherwise, the <code>key</code> and
>>  * <code>value</code> must be well-formed, or else the ulb_build() method
>> will
>>  * later report an U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p><b>Note:</b> The key ('u') is used for the Unicode locale extension.
>>  * Setting a value for this key replaces any existing Unicode locale
>> key/type
>>  * pairs with those defined in the extension.
>>  *
>>  * <p><b>Note:</b> The key ('x') is used for the private use code. To be
>>  * well-formed, the value for this key needs only to have subtags of one
>> to
>>  * eight alphanumeric characters, not two to eight as in the general case.
>>  *
>>  * @param builder the builder
>>  * @param key the extension key
>>  * @param value the extension value
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setExtension(ULocaleBuilder builder, char key, const char* value);
>>
>> /**
>>  * Sets the Unicode locale keyword type for the given key. If the type
>>  * StringPiece is constructed with a nullptr, the keyword is removed.
>>  * If the type is the empty string, the keyword is set without type
>> subtags.
>>  * Otherwise, the key and type must be well-formed, or else the
>> ulb_build()
>>  * method will later report an U_ILLEGAL_ARGUMENT_ERROR.
>>  *
>>  * <p>Keys and types are converted to lower case.
>>  *
>>  * <p><b>Note</b>:Setting the 'u' extension via {@link #setExtension}
>>  * replaces all Unicode locale keywords with those defined in the
>>  * extension.
>>  *
>>  * @param builder the builder
>>  * @param key the Unicode locale key
>>  * @param type the Unicode locale type
>>  * @return This builder.
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_setUnicodeLocaleKeyword(ULocaleBuilder builder,
>>         const char* key, const char* type);
>>
>> /**
>>  * Adds a unicode locale attribute, if not already present, otherwise
>>  * has no effect.  The attribute must not be empty string and must be
>>  * well-formed or U_ILLEGAL_ARGUMENT_ERROR will be set to status
>>  * during the ulb_build() call.
>>  *
>>  * @param builder the builder
>>  * @param attribute the attribute
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_addUnicodeLocaleAttribute(ULocaleBuilder builder, const char*
>> attribute);
>>
>> /**
>>  * Removes a unicode locale attribute, if present, otherwise has no
>>  * effect.  The attribute must not be empty string and must be well-formed
>>  * or U_ILLEGAL_ARGUMENT_ERROR will be set to status during the
>> ulb_build() call.
>>  *
>>  * <p>Attribute comparison for removal is case-insensitive.
>>  *
>>  * @param builder the builder
>>  * @param attribute the attribute
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_removeUnicodeLocaleAttribute(ULocaleBuilder builder, const char*
>> attribute);
>>
>> /**
>>  * Resets the builder to its initial, empty state.
>>  * <p>This method clears the internal UErrorCode.
>>  *
>>  * @param builder the builder
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_clear(ULocaleBuilder builder);
>>
>> /**
>>  * Resets the extensions to their initial, empty state.
>>  * Language, script, region and variant are unchanged.
>>  *
>>  * @param builder the builder
>>  * @draft ICU 74
>>  */
>> U_CAPI void U_EXPORT2
>> ulb_clearExtensions(ULocaleBuilder builder);
>>
>> /*
>>  * Build the Locale< stirng from the fields set
>>  * on this builder.
>>  * If any set methods or during the ulb_build() call require memory
>> allocation
>>  * but fail U_MEMORY_ALLOCATION_ERROR will be set to status.
>>  * If any of the fields set by the setters are not well-formed, the status
>>  * will be set to U_ILLEGAL_ARGUMENT_ERROR. The state of the builder will
>>  * not change after the ulb_build() call and the caller is free to keep
>> using
>>  * the same builder to build more locales.
>>  *
>>  * @param builder the builder
>>  * @param err the error code
>>  * @return the length of the locale id in buffer
>>  * @draft ICU 74
>>  */
>> U_CAPI int32_t U_EXPORT2
>> ulb_build(ULocaleBuilder builder, char* buffer, int32_t bufferCapacity,
>> UErrorCode* err);
>>
>> /**
>>  * Sets the UErrorCode if an error occurred while recording sets.
>>  * Preserves older error codes in the outErrorCode.
>>  *
>>  * @param builder the builder
>>  * @param outErrorCode Set to an error code that occurred while setting
>> subtags.
>>  *                  Unchanged if there is no such error or if outErrorCode
>>  *                  already contained an error.
>>  * @return true if U_FAILURE(*outErrorCode)
>>  * @draft ICU 74
>>  */
>> U_CAPI UBool U_EXPORT2
>> ulb_copyErrorTo(ULocaleBuilder builder, UErrorCode *outErrorCode);
>>
>> #endif  /* U_HIDE_DRAFT_API */
>>
>> #endif  // __ULOCBUILDER_H__
>> --
>> Frank Yung-Fong Tang
>> 譚永鋒 / 🌭🍊
>> Sr. Software Engineer
>> _______________________________________________
>> icu-design mailing list
>> icu...@li...
>> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design
>>
>>
>>
>
> --
> Frank Yung-Fong Tang
> 譚永鋒 / 🌭🍊
> Sr. Software Engineer
>
>
>

-- 
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer

-- 
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer

[icu-design] Fwd: C API for LocaleBuilder

Open Source C/C++/Java libraries from Unicode

[icu-design] Fwd: C API for LocaleBuilder