|
From: Frank T. (譚. <ft...@go...> - 2023-06-30 23:33:39
|
On Fri, 30 Jun 2023 at 16:04, Steven R. Loomis <sr...@gm...> wrote: > Yes, there are separate functions in the C API. ucal, etc however uses. > The performance argument could be mitigated with inline code I suppose. > inline code? Does ICU allow inline in the public C API? Do we currently have any C API inline under common/unicode and i18n/unicode which are accessible by C code? > There’s a cost to each function. For wrapping or exporting it to another > language (some kind of binding) it’s yet another function. > > my ¤0.02 > > -- > Steven R. Loomis > Code Hive Tx, LLC > https://codehivetx.us > > > > On Jun 30, 2023, at 5:53 PM, Frank Tang (譚永鋒) via icu-design < > icu...@li...> wrote: > > > > ---------- Forwarded message --------- > From: Frank Tang (譚永鋒) <ft...@go...> > Date: Fri, 30 Jun 2023 at 15:53 > Subject: Re: [icu-design] C API for LocaleBuilder > To: Rich Gillam <ric...@ap...> > > > > > On Fri, 30 Jun 2023 at 11:00, Rich Gillam <ric...@ap...> > wrote: > >> Frank-- >> >> That would be a different approach to construct a locale. However, I >> would like to point out several things here. >> 1. That approach won't be a "builder" >> >> >> But it would solve the problem we’re being asked to solve. >> > > no, it does not. What the bug asked for is a C builder API and whatever > the API does not follow a "builder" design pattern does not solve the > problem it asked for. A builder needs to be step by step so the decision of > each field could be spread in different times and different places in the > code. A single function with a struct does not allow that, and therefore > not fulfilling the request and requirement. > > I am open to a concrete counter proposal. It is simple to think the issue > could be solved by a simple set() w/ a constant if you only consider the 1) > language, 2) script, 3) region, 4) variant, 5) languageTag, and 6) localeId > as the only set of input. But once you start to consider 7) attributes, 8) > unicode locale keyword/value, and 9) other extensions than "-u-" (for > example "-t-" or anything in other_extension) things become complicated. > The validation and override/replacement logic for each of them will be > different > > > >> >> but what you describe would require the caller to resolve that before >> putting it into the struct. >> >> >> No, it wouldn’t. For building locales, there’s nothing special going on >> while you’re setting up the builder— it’s just a bag of fields. >> > > no, that is not true, it is NOT a bag of fields, it is an order list of > fields, not a bag => the order matter. > in bag sematic , the order does not matter > > Consider the following > > languageTag "en-Latn-GB-u-ca-roc" > language "ja" > script "Thai" > region "CN" > variant "hepburn-heploc" > unicode extension "ca"= "japanese" > language "sl" > variant "bohoric" > unicode extension "ca"= "coptic" > > which will produce sl-Thai-CN-bohoric-u-ca-coptic" > > but if you treat them as a bug and build that the order in > language "ja" > script "Thai" > region "CN" > variant "hepburn-heploc" > unicode extension "ca"= "japanese" > language "sl" > languageTag "en-Latn-GB-u-ca-roc" > variant "bohoric" > unicode extension "ca"= "coptic" > > you will get > which will produce en-Latn-GB-bohoric-u-ca-coptic" > > which result in different value > > > > >> All the magic happens when you run the build() method. >> > > That should be implementation depend and should not be mandated by the > API, there are nothing wrong to progressively build when each setting > method got called. > > >> I’m just saying that maybe instead of adding a new opaque class with a >> bunch of individual getters and setters for all those fields, we just use a >> simple struct (or a hash table, or something). >> > > As I mentioned, what you said only make sense when the underline problem is > *a bag of fields*, but the problem is this problem is not just a* bag of > fields.* The information pass into the builder have overlapping with each > other, in particular the information inside localeID and languageTag is not > multurelly exclusive from language/script/region/variant/attribute/unicode > locale keyword+value and other extension. They over lap and part of them > could be override by another. If all the input are muturally exclusive from > each other, then the problem is much simpler and possible could be solved > by "just use a simple struct" . But that is NOT the case.(as the example I > show above) now. > > > >> The approach you mentioned seems to require the caller to do a lot of >> work to fill that struct. and the caller also needs to learn how to program >> UHashTable while using that API, is that better? >> >> >> Good point. UHashTable is a pain in the butt. Maybe that’s enough right >> there to swing me back around to what you’ve already proposed… >> >> That said, I do kind of like Steven’s suggestion to just have one set() >> method and just have constants for the individual fields... >> > > If we look at the pre-existing locale C API in uloc.h we have different > function instead of a get w/ a constant > uloc_getLanguage(.... > uloc_getScript(.... > uloc_getCountry(.... > uloc_getVariant(.... > uloc_getName(.... > uloc_getBaseName(... > > notice there is no API to get attributes. > > we could also use a uloc_get() with a constant, right? but we didn't years > ago. so... should we follow the same design pattern or to a set w/ constant > > > > We have the following different kind of operation to the builder > > 1. language (w/o script and region), region, script, variant, language tag > (w/ script or region) => set > 2. attribute => add and remove (there could be multiple attributes in a > locale > 3. unicode extension set on a key ("ca" , "japanese") > 4. other extension - set on a extension ('a', "abc-efg") > 5. clear extension (say you start from "en-u-ca-japanese-nu-Thai" and you > want to clear the calendar and change nu to "Arab" to "en-u-nu-Arab") > > a simple set w/ constant method could simplify #1 above, but not powerful > enough to deal with #2 #3 #4 and #5 > Also, I personally dislike the set + enum => internally swtich(enum) > approach because almost all the time the caller know what operation need to > perform in the compile time, but this approach mandate the function > internally switch based on a value which is fixed in the program but cannot > be easily optimized by the compiler. > > Consider the following case > > > setA(context, value) { > context.a = value; > } > > setB(context, value) { > context.b = value; > } > and the caller > > setA(context, 3); > setB(context, 5); > > vs > set(context, field, value) { > switch(field) { > case A: context.a = value; break; > case B: context.b = value; break; > } > } > and the caller > set(context, A, 3); > set(context, B, 5); > > The swtich / case inside the set() simply cannot be optimized by the > compiler > > surely the first approach expose more functions but such exposure allow > the compiler to avoid unnecessary run time hit by the switch/case which > could be resovled in the compile time, not during the run time. > > > >> —Rich >> >> On Jun 30, 2023, at 12:44 AM, Frank Tang (譚永鋒) <ft...@go...> wrote: >> >> Dear Richard >> >> On Thu, 29 Jun 2023 at 18:31, Richard Gillam <ric...@ap...> >> wrote: >> >>> Frank— >>> >>> This looks fine, but also kind of heavy-weight. I wonder if it’d make >>> sense to just have a struct with all the fields and utility methods to >>> convert between a locale ID string and that struct? >>> >> >> That would be a different approach to construct a locale. However, I >> would like to point out several things here. >> 1. That approach won't be a "builder" >> The definition of a Builder design pattern is >> "Builder is a creational design pattern that lets you construct complex >> objects step by step. " >> "Problem >> Imagine a complex object that requires laborious, step-by-step >> initialization of many fields and nested objects. Such initialization code >> is usually buried inside a monstrous constructor with lots of parameters. >> Or even worse: scattered all over the client code." >> >> In other words, the key concept of "builder" itself is "step by step" and >> that allows the builder to get passing to different parts of the system to >> allow different part of the software to add information into the builder, >> without a centralized place to gather all the information together. >> see https://refactoring.guru/design-patterns/builder >> >> 2. The step by step processing also means the timing of information >> putting in matter and the later information overrides the previous >> information. >> for example, I could have a >> ULocaleBuilder builder = ulb_open(); >> ulb_setLanguageTag(builder, "en-Latn-GB-u-ca-islamic"); >> ulb_setRegion(builder, "FR"); >> ulb_setUnicodeExtension(builder, 'x', "abcd"); >> ..... >> ulb_setUnicodeLocaleKeyword(builder, "ca", "japanese"); >> ulb_setRegion(builder, "DE"); >> ulb_setUnicodeLocaleKeyword(builder, "ca", "roc"); >> ulb_build(builder...) >> ulb_close(builder) >> >> to build a locale "en-Latn-DE-u-ca-roc-x-abcd" >> >> but what you describe would require the caller to resolve that before >> putting it into the struct. >> >> I guess the weak point in that approach would be the key-value pairs at >>> the end of the locale ID— an alternative approach might be to use a >>> UHashTable (or whatever it's called) as the bearer of fields (or to provide >>> both the struct and the hash table, depending on whether the caller cares >>> about the key-value fields). The beauty of an approach like this is that >>> you save a few heap allocations and don’t have to add as many new API >>> functions. >>> >> >> The approach you mentioned seems to require the caller to do a lot of >> work to fill that struct. and the caller also needs to learn how to program >> UHashTable while using that API, is that better? >> >> >>> On the other hand, what you have is a thin wrapper around the existing >>> C++ LocaleBuilder class, so you’re not having to implement the >>> locale-mangling code all over again. >>> >>> I don’t know— what does everybody else think? >>> >>> —Rich >>> >>> On Jun 29, 2023, at 5:09 PM, Frank Tang (譚永鋒) via icu-design < >>> icu...@li...> wrote: >>> >>> >>> >>> Dear ICU team & users, >>> >>> >>> I would like to propose the following for: ICU 74 >>> >>> Please provide feedback by: Next Wednesday, July 5, or any time >>> sufficiently in advance of the feature freeze >>> >>> Designated API review: Gary Wade >>> >>> Issue: https://unicode-org.atlassian.net/browse/ICU-22365 >>> A draft PR and implementation (not including test yet) could be found at >>> https://github.com/unicode-org/icu/pull/2520 ) >>> >>> This is to follow up the C++ API in ICU 64 and provide a C API requested >>> in >>> >>> Issue to be discussed >>> >>> 1. Should the prefix be ulb_ or something else? >>> 2. What should ulb_build build? the value as returned by >>> Locale::getName() or Locale::toLanguageTag()? >>> 3. is it to use file ulocbuilder.{h,cpp}? or should it be something else? >>> >>> Add C API for ULocaleBuilder >>> >>> https://unicode-org.atlassian.net/browse/ICU-22365 >>> >>> File : icu4c/source/common/ulocbuilder.h >>> >>> // © 2023 and later: Unicode, Inc. and others. >>> // License & terms of use: http://www.unicode.org/copyright.html >>> #ifndef __ULOCBUILDER_H__ >>> #define __ULOCBUILDER_H__ >>> >>> #include "unicode/utypes.h" >>> >>> >>> /** >>> * \file >>> * \brief C API: Builder API for Locale >>> */ >>> >>> #ifndef U_HIDE_DRAFT_API >>> >>> /** >>> * Opaque type for a Locale builder. >>> * @draft ICU 74 >>> */ >>> typedef void *ULocaleBuilder; >>> >>> /** >>> * <code>ULocaleBuilder</code> is used to build stirng of valid >>> <code>locale</code> >>> * from values configured by the setters. >>> * The <code>ULocaleBuilder</code> checks if a value configured by a >>> * setter satisfies the syntax requirements defined by the >>> <code>Locale</code> >>> * class. A string of Locale created by a <code>ULocaleBuilder</code> is >>> * well-formed and can be transformed to a well-formed IETF BCP 47 >>> language tag >>> * without losing information. >>> * >>> * <p>The following example shows how to create a <code>locale</code> >>> string >>> * with the <code>ULocaleBuilder</code>. >>> * <blockquote> >>> * <pre> >>> * UErrorCode err = U_ZERO_ERROR; >>> * char buffer[ULOC_FULLNAME_CAPACITY]; >>> * ULocaleBuilder builder = ulb_open(); >>> * ulb_setLanguage(builder, "sr"); >>> * ulb_setScript(builder, "Latn"); >>> * ulb_setRegion(builder, "RS"); >>> * int32_t length = ulb_build( >>> * builder, buffer, ULOC_FULLNAME_CAPACITY, &error); >>> * ulb_close(builder); >>> * </pre> >>> * </blockquote> >>> * >>> * <p>ULocaleBuilders can be reused; <code>ulb_clear()</code> resets all >>> * fields to their default values. >>> * >>> * <p>ULocaleBuilder tracks errors in an internal UErrorCode. For all >>> setters, >>> * except ulb_setLanguageTag and ulb_setLocale, ULocaleBuilder will >>> return immediately >>> * if the internal UErrorCode is in error state. >>> * To reset internal state and error code, call clear method. >>> * The ulb_setLanguageTag and setLocale method will first clear the >>> internal >>> * UErrorCode, then track the error of the validation of the input >>> parameter >>> * into the internal UErrorCode. >>> * >>> * @draft ICU 74 >>> */ >>> >>> /** >>> * Constructs an empty ULocaleBuilder. The default value of all >>> * fields, extensions, and private use information is the >>> * empty string. The created builder should be destoried by calling >>> * ulb_close(); >>> * >>> * @draft ICU 74 >>> */ >>> U_CAPI ULocaleBuilder U_EXPORT2 >>> ulb_open(); >>> >>> /** >>> * Close the builder and destroy it's internal states. >>> * @param builder the builder >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_close(ULocaleBuilder builder); >>> >>> /** >>> * Resets the <code>ULocaleBuilder</code> to match the provided >>> * <code>locale</code>. Existing state is discarded. >>> * >>> * <p>All fields of the locale must be well-formed. >>> * <p>This method clears the internal UErrorCode. >>> * >>> * @param builder the builder >>> * @param locale the locale >>> * >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setLocale(ULocaleBuilder builder, const char* locale); >>> >>> /** >>> * Resets the ULocaleBuilder to match the provided IETF BCP 47 language >>> tag. >>> * Discards the existing state. >>> * The empty string causes the builder to be reset, like {@link #clear}. >>> * Legacy language tags (marked as “Type: grandfathered” in BCP 47) >>> * are converted to their canonical form before being processed. >>> * Otherwise, the <code>language tag</code> must be well-formed, >>> * or else the ulb_build() method will later report an >>> U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p>This method clears the internal UErrorCode. >>> * >>> * @param builder the builder >>> * @param tag the language tag, defined as IETF BCP 47 language tag. >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setLanguageTag(ULocaleBuilder builder, const char* tag); >>> >>> /** >>> * Sets the language. If <code>language</code> is the empty string, the >>> * language in this <code>ULocaleBuilder</code> is removed. Otherwise, >>> the >>> * <code>language</code> must be well-formed, or else the ulb_build() >>> method will >>> * later report an U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p>The syntax of language value is defined as >>> * [unicode_language_subtag]( >>> http://www.unicode.org/reports/tr35/tr35.html#unicode_language_subtag). >>> * >>> * @param builder the builder >>> * @param language the language >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setLanguage(ULocaleBuilder builder, const char* language); >>> >>> /** >>> * Sets the script. If <code>script</code> is the empty string, the >>> script in >>> * this <code>ULocaleBuilder</code> is removed. >>> * Otherwise, the <code>script</code> must be well-formed, or else the >>> ulb_build() >>> * method will later report an U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p>The script value is a four-letter script code as >>> * [unicode_script_subtag]( >>> http://www.unicode.org/reports/tr35/tr35.html#unicode_script_subtag) >>> * defined by ISO 15924 >>> * >>> * @param builder the builder >>> * @param script the script >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setScript(ULocaleBuilder builder, const char* script); >>> >>> /** >>> * Sets the region. If region is the empty string, the region in this >>> * <code>ULocaleBuilder</code> is removed. Otherwise, the >>> <code>region</code> >>> * must be well-formed, or else the ulb_build() method will later report >>> an >>> * U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p>The region value is defined by >>> * [unicode_region_subtag]( >>> http://www.unicode.org/reports/tr35/tr35.html#unicode_region_subtag) >>> * as a two-letter ISO 3166 code or a three-digit UN M.49 area code. >>> * >>> * <p>The region value in the <code>Locale</code> created by the >>> * <code>ULocaleBuilder</code> is always normalized to upper case. >>> * >>> * @param builder the builder >>> * @param region the region >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setRegion(ULocaleBuilder builder, const char* region); >>> >>> /** >>> * Sets the variant. If variant is the empty string, the variant in this >>> * <code>LocaleBuilder</code> is removed. Otherwise, the >>> <code>variant</code> >>> * must be well-formed, or else the ulb_build() method will later report >>> an >>> * U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p><b>Note:</b> This method checks if <code>variant</code> >>> * satisfies the >>> * [unicode_variant_subtag]( >>> http://www.unicode.org/reports/tr35/tr35.html#unicode_variant_subtag) >>> * syntax requirements, and normalizes the value to lowercase letters. >>> However, >>> * the <code>Locale</code> class does not impose any syntactic >>> * restriction on variant. To set an ill-formed variant, use a Locale >>> constructor. >>> * If there are multiple unicode_variant_subtag, the caller must >>> concatenate >>> * them with '-' as separator (ex: "foobar-fibar"). >>> * >>> * @param builder the builder >>> * @param variant the variant >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setVariant(ULocaleBuilder builder, const char* variant); >>> >>> /** >>> * Sets the extension for the given key. If the value is the empty >>> string, >>> * the extension is removed. Otherwise, the <code>key</code> and >>> * <code>value</code> must be well-formed, or else the ulb_build() >>> method will >>> * later report an U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p><b>Note:</b> The key ('u') is used for the Unicode locale >>> extension. >>> * Setting a value for this key replaces any existing Unicode locale >>> key/type >>> * pairs with those defined in the extension. >>> * >>> * <p><b>Note:</b> The key ('x') is used for the private use code. To be >>> * well-formed, the value for this key needs only to have subtags of one >>> to >>> * eight alphanumeric characters, not two to eight as in the general >>> case. >>> * >>> * @param builder the builder >>> * @param key the extension key >>> * @param value the extension value >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setExtension(ULocaleBuilder builder, char key, const char* value); >>> >>> /** >>> * Sets the Unicode locale keyword type for the given key. If the type >>> * StringPiece is constructed with a nullptr, the keyword is removed. >>> * If the type is the empty string, the keyword is set without type >>> subtags. >>> * Otherwise, the key and type must be well-formed, or else the >>> ulb_build() >>> * method will later report an U_ILLEGAL_ARGUMENT_ERROR. >>> * >>> * <p>Keys and types are converted to lower case. >>> * >>> * <p><b>Note</b>:Setting the 'u' extension via {@link #setExtension} >>> * replaces all Unicode locale keywords with those defined in the >>> * extension. >>> * >>> * @param builder the builder >>> * @param key the Unicode locale key >>> * @param type the Unicode locale type >>> * @return This builder. >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_setUnicodeLocaleKeyword(ULocaleBuilder builder, >>> const char* key, const char* type); >>> >>> /** >>> * Adds a unicode locale attribute, if not already present, otherwise >>> * has no effect. The attribute must not be empty string and must be >>> * well-formed or U_ILLEGAL_ARGUMENT_ERROR will be set to status >>> * during the ulb_build() call. >>> * >>> * @param builder the builder >>> * @param attribute the attribute >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_addUnicodeLocaleAttribute(ULocaleBuilder builder, const char* >>> attribute); >>> >>> /** >>> * Removes a unicode locale attribute, if present, otherwise has no >>> * effect. The attribute must not be empty string and must be >>> well-formed >>> * or U_ILLEGAL_ARGUMENT_ERROR will be set to status during the >>> ulb_build() call. >>> * >>> * <p>Attribute comparison for removal is case-insensitive. >>> * >>> * @param builder the builder >>> * @param attribute the attribute >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_removeUnicodeLocaleAttribute(ULocaleBuilder builder, const char* >>> attribute); >>> >>> /** >>> * Resets the builder to its initial, empty state. >>> * <p>This method clears the internal UErrorCode. >>> * >>> * @param builder the builder >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_clear(ULocaleBuilder builder); >>> >>> /** >>> * Resets the extensions to their initial, empty state. >>> * Language, script, region and variant are unchanged. >>> * >>> * @param builder the builder >>> * @draft ICU 74 >>> */ >>> U_CAPI void U_EXPORT2 >>> ulb_clearExtensions(ULocaleBuilder builder); >>> >>> /* >>> * Build the Locale< stirng from the fields set >>> * on this builder. >>> * If any set methods or during the ulb_build() call require memory >>> allocation >>> * but fail U_MEMORY_ALLOCATION_ERROR will be set to status. >>> * If any of the fields set by the setters are not well-formed, the >>> status >>> * will be set to U_ILLEGAL_ARGUMENT_ERROR. The state of the builder will >>> * not change after the ulb_build() call and the caller is free to keep >>> using >>> * the same builder to build more locales. >>> * >>> * @param builder the builder >>> * @param err the error code >>> * @return the length of the locale id in buffer >>> * @draft ICU 74 >>> */ >>> U_CAPI int32_t U_EXPORT2 >>> ulb_build(ULocaleBuilder builder, char* buffer, int32_t bufferCapacity, >>> UErrorCode* err); >>> >>> /** >>> * Sets the UErrorCode if an error occurred while recording sets. >>> * Preserves older error codes in the outErrorCode. >>> * >>> * @param builder the builder >>> * @param outErrorCode Set to an error code that occurred while setting >>> subtags. >>> * Unchanged if there is no such error or if >>> outErrorCode >>> * already contained an error. >>> * @return true if U_FAILURE(*outErrorCode) >>> * @draft ICU 74 >>> */ >>> U_CAPI UBool U_EXPORT2 >>> ulb_copyErrorTo(ULocaleBuilder builder, UErrorCode *outErrorCode); >>> >>> #endif /* U_HIDE_DRAFT_API */ >>> >>> #endif // __ULOCBUILDER_H__ >>> -- >>> Frank Yung-Fong Tang >>> 譚永鋒 / 🌭🍊 >>> Sr. Software Engineer >>> _______________________________________________ >>> icu-design mailing list >>> icu...@li... >>> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design >>> >>> >>> >> >> -- >> Frank Yung-Fong Tang >> 譚永鋒 / 🌭🍊 >> Sr. Software Engineer >> >> >> > > -- > Frank Yung-Fong Tang > 譚永鋒 / 🌭🍊 > Sr. Software Engineer > > > -- > Frank Yung-Fong Tang > 譚永鋒 / 🌭🍊 > Sr. Software Engineer > _______________________________________________ > icu-design mailing list > icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design > > > -- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer |