You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(142) |
Jun
(150) |
Jul
(250) |
Aug
(140) |
Sep
(200) |
Oct
(155) |
Nov
(176) |
Dec
(74) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(228) |
Feb
(347) |
Mar
(193) |
Apr
(73) |
May
(46) |
Jun
(90) |
Jul
(35) |
Aug
(39) |
Sep
(47) |
Oct
(91) |
Nov
(36) |
Dec
(6) |
2003 |
Jan
(24) |
Feb
(32) |
Mar
(33) |
Apr
(142) |
May
(55) |
Jun
(20) |
Jul
(47) |
Aug
(14) |
Sep
(43) |
Oct
(46) |
Nov
(68) |
Dec
(17) |
2004 |
Jan
(33) |
Feb
(21) |
Mar
(32) |
Apr
(22) |
May
(22) |
Jun
(14) |
Jul
(13) |
Aug
(23) |
Sep
(3) |
Oct
(26) |
Nov
(52) |
Dec
(24) |
2005 |
Jan
(16) |
Feb
(21) |
Mar
(5) |
Apr
(19) |
May
(37) |
Jun
(88) |
Jul
(17) |
Aug
(89) |
Sep
(39) |
Oct
(30) |
Nov
(30) |
Dec
(32) |
2006 |
Jan
(25) |
Feb
(88) |
Mar
(99) |
Apr
(86) |
May
(54) |
Jun
(57) |
Jul
(37) |
Aug
(41) |
Sep
(48) |
Oct
(30) |
Nov
(9) |
Dec
(4) |
2007 |
Jan
(24) |
Feb
(38) |
Mar
(15) |
Apr
(32) |
May
(24) |
Jun
(20) |
Jul
(92) |
Aug
(35) |
Sep
(14) |
Oct
(33) |
Nov
(18) |
Dec
(7) |
2008 |
Jan
(57) |
Feb
(7) |
Mar
(17) |
Apr
(1) |
May
(49) |
Jun
(14) |
Jul
(6) |
Aug
(5) |
Sep
(9) |
Oct
(26) |
Nov
(21) |
Dec
(8) |
2009 |
Jan
(22) |
Feb
(56) |
Mar
(26) |
Apr
(15) |
May
(2) |
Jun
(9) |
Jul
(21) |
Aug
(14) |
Sep
(27) |
Oct
(38) |
Nov
(31) |
Dec
(47) |
2010 |
Jan
(92) |
Feb
(30) |
Mar
(8) |
Apr
(45) |
May
(23) |
Jun
(28) |
Jul
(57) |
Aug
(83) |
Sep
(5) |
Oct
(14) |
Nov
(8) |
Dec
(15) |
2011 |
Jan
(37) |
Feb
(84) |
Mar
(89) |
Apr
(90) |
May
(19) |
Jun
(15) |
Jul
(12) |
Aug
(34) |
Sep
(58) |
Oct
(6) |
Nov
(16) |
Dec
(25) |
2012 |
Jan
(22) |
Feb
(57) |
Mar
(13) |
Apr
(29) |
May
(34) |
Jun
(20) |
Jul
(19) |
Aug
(12) |
Sep
(76) |
Oct
(70) |
Nov
(17) |
Dec
(10) |
2013 |
Jan
(47) |
Feb
(16) |
Mar
(33) |
Apr
(36) |
May
(46) |
Jun
(2) |
Jul
(10) |
Aug
(19) |
Sep
(13) |
Oct
(27) |
Nov
(34) |
Dec
(54) |
2014 |
Jan
(44) |
Feb
(13) |
Mar
(20) |
Apr
(49) |
May
(18) |
Jun
(15) |
Jul
(47) |
Aug
(23) |
Sep
(21) |
Oct
(11) |
Nov
(8) |
Dec
(12) |
2015 |
Jan
(11) |
Feb
(20) |
Mar
(5) |
Apr
(8) |
May
(5) |
Jun
(1) |
Jul
(3) |
Aug
(9) |
Sep
(21) |
Oct
(1) |
Nov
(8) |
Dec
(4) |
2016 |
Jan
(16) |
Feb
(7) |
Mar
(6) |
Apr
(18) |
May
(1) |
Jun
(4) |
Jul
(5) |
Aug
(17) |
Sep
(11) |
Oct
(2) |
Nov
(1) |
Dec
(6) |
2017 |
Jan
(14) |
Feb
(19) |
Mar
(12) |
Apr
(6) |
May
(4) |
Jun
(5) |
Jul
(16) |
Aug
(20) |
Sep
(8) |
Oct
(1) |
Nov
|
Dec
(8) |
2018 |
Jan
(2) |
Feb
(26) |
Mar
(22) |
Apr
(12) |
May
(23) |
Jun
(3) |
Jul
(2) |
Aug
(26) |
Sep
(5) |
Oct
(44) |
Nov
(4) |
Dec
(14) |
2019 |
Jan
(28) |
Feb
(15) |
Mar
(1) |
Apr
(2) |
May
(9) |
Jun
(16) |
Jul
(8) |
Aug
(14) |
Sep
|
Oct
(10) |
Nov
(16) |
Dec
(4) |
2020 |
Jan
(19) |
Feb
(21) |
Mar
(12) |
Apr
(7) |
May
(12) |
Jun
(10) |
Jul
(2) |
Aug
(15) |
Sep
(6) |
Oct
|
Nov
(6) |
Dec
(62) |
2021 |
Jan
(12) |
Feb
(39) |
Mar
|
Apr
|
May
|
Jun
(20) |
Jul
(5) |
Aug
(9) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
(7) |
Feb
(4) |
Mar
(5) |
Apr
(4) |
May
(7) |
Jun
(10) |
Jul
(6) |
Aug
(11) |
Sep
|
Oct
|
Nov
(13) |
Dec
(3) |
2023 |
Jan
(13) |
Feb
(22) |
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(18) |
Jul
(8) |
Aug
(16) |
Sep
(38) |
Oct
(8) |
Nov
(4) |
Dec
(6) |
2024 |
Jan
(8) |
Feb
(3) |
Mar
(4) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Mihai ⦅U⦆ N. <mih...@un...> - 2024-03-18 20:33:54
|
Dear ICU team & users, *I would like to propose the following for: ICU 75* Please provide feedback by: Next Wednesday, Mar 27, or any time sufficiently in advance of the feature freeze *Note* that the changes were either requested by the TC (renaming Mf2* classes to MF*) or needed to implement the updated spec. *Designated API review:* Markus Scherer https://unicode-org.atlassian.net/browse/ICU-22690 *Link to detailed document:**"ICU4J: MessageFormat 2 APIs, LDML 45 Tech Preview, spring 2024"* https://docs.google.com/document/d/1CzzbO5Es-OVMG6OnOMOgJwyU0zn8VlD_LJX3i9kMsuA/edit?usp=sharing&resourcekey=0-CLzUfOMJA01jv16aEDj65w Summary: - Renamed classes from Mf2* to MF2* as requested by the TC - Renamed Mf2FunctionRegistry to MFFunctionRegistry - Renamed Mf2DataModel to MFDataModel - Renamed Mf2Parser to MFParser - Renamed Mf2Serializer to MFSerializer - Caused by spec changes: - The content of the MFDataModel changed completely. There was no spec before, now there is, and the new implementation follows it. - Because the spec changed the selection algorithm the Selector interface changed from: boolean matches(Object value, String key, Map<String, Object> variableOptions) to: List<String> matches(Object value, List<String> keys, Map<String, Object> variableOptions) - MessageFormatter APIs that took / returned a Mf2DataModel now take / return Mf2DataModel.Message (2 methods) - MFParseException is a standalone class, instead of inner to MFParser - Implementation details: - MFParser: content changed completely. But it is an implementation detail, no public APIs affected. One public API: static MFDataModel.Message parse(String pattern) - MFSerializer: content changed completely. But it is an implementation detail, no public APIs affected. One public API: static String dataModelToString(MFDataModel.Message) Thank you very much, Mihai |
From: Markus S. <mar...@gm...> - 2024-03-14 17:47:57
|
On Thu, Mar 14, 2024 at 7:16 AM Mark Davis Ⓤ <ma...@un...> wrote: > Lgtm. I like the modernization for the Java API > Thanks! This was approved in the TC meeting today with the following changes: - Return type of u_hasIDType() changed from UBool to bool - Document for getIDTypes() that the maximum number is limited by the number of enum constants, and by the number of types that can be combined with others. For the latter, I am writing this: * Each code point maps to a <i>set</i> of UIdentifierType values. * There is always at least one type. * The order of output values is undefined. * Each type is output at most once; * there cannot be more output values than UIdentifierType constants. * In addition, only some of the types can be combined with others, * and usually only a small number of types occur together. * Future versions might add additional types. * See UTS #39 and its data files for details. Best regards, markus > |
From: Tim C. <tj...@ig...> - 2024-03-14 16:07:17
|
Hello, As the TC plans to approve the design document for the MessageFormat v2 tech preview for C++ during the coming week, I wanted to provide one more chance for any interested parties to comment on the API changes proposed in the design document: Doc: https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 The beginning of the document summarizes changes made to it since my previous email to this list in February. Thanks, Tim |
From: Markus S. <mar...@gm...> - 2024-03-14 05:28:10
|
Dear ICU team & users, I would like to propose the following API for: *ICU 75* Please provide feedback by: *next Wednesday, 2024-03-20* Designated API reviewer: *Mark* Ticket: https://unicode-org.atlassian.net/browse/*ICU-11396* <https://unicode-org.atlassian.net/browse/ICU-11396> WIP PR: https://github.com/unicode-org/icu/pull/2879 UTS #39 “Unicode Security Mechanisms” defines two properties Identifier_Status & Identifier_Type. These are outside of the Unicode Character Database (UCD) per se, but are formally defined in that spec and in its data files, together with the names of these properties and the names of their values. See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type The Identifier_Status is a simple enumerated property. It only has two values, so could have been a binary property, but it was defined as enumerated in case it became useful to add another value. The Identifier_Type maps each code point to a *set* of enumerated values (at least one). In this way, it is similar to Script_Extensions. However, while Script_Extensions are sets of values of another property (Script), the Identifier_Type sets contain values which are defined only for this property. For the API, I propose a set of enum values for each property, and a pair of functions like those for Script_Extensions. In C, these look very much like existing APIs. In Java, I propose something slightly new: In the past, we have defined enumerated property values as *public static final int* constants, and UScript.getScriptExtensions() populated a BitSet of those int values. For these new properties, I propose real Java enums, and for UCharacter.getIdentifierTypes() to populate an EnumSet. The function names follow conventions of existing, similar functions, with "ID" in C and "Identifier" in Java. These properties will also become available in UnicodeSet patterns. *C: uchar.h* /** * Enumerated property Identifier_Status. * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * @draft ICU 75 */ UCHAR_IDENTIFIER_STATUS=0x1019, /** * Miscellaneous property Identifier_Type. * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * Corresponds to u_hasIDType() and u_getIDTypes(). * * Each code point maps to a <i>set</i> of UIdentifierType values. * * @see u_hasIDType * @see u_getIDTypes * @draft ICU 75 */ UCHAR_IDENTIFIER_TYPE=0x7001, /** * Identifier Status constants. * See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type. * * @see UCHAR_IDENTIFIER_STATUS * @draft ICU 75 */ typedef enum UIdentifierStatus { /** @draft ICU 75 */ U_ID_STATUS_RESTRICTED, /** @draft ICU 75 */ U_ID_STATUS_ALLOWED, } UIdentifierStatus; /** * Identifier Type constants. * See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type. * * @see UCHAR_IDENTIFIER_TYPE * @draft ICU 75 */ typedef enum UIdentifierType { /** @draft ICU 75 */ U_ID_TYPE_NOT_CHARACTER, /** @draft ICU 75 */ U_ID_TYPE_DEPRECATED, /** @draft ICU 75 */ U_ID_TYPE_DEFAULT_IGNORABLE, /** @draft ICU 75 */ U_ID_TYPE_NOT_NFKC, /** @draft ICU 75 */ U_ID_TYPE_NOT_XID, /** @draft ICU 75 */ U_ID_TYPE_EXCLUSION, /** @draft ICU 75 */ U_ID_TYPE_OBSOLETE, /** @draft ICU 75 */ U_ID_TYPE_TECHNICAL, /** @draft ICU 75 */ U_ID_TYPE_UNCOMMON_USE, /** @draft ICU 75 */ U_ID_TYPE_LIMITED_USE, /** @draft ICU 75 */ U_ID_TYPE_INCLUSION, /** @draft ICU 75 */ U_ID_TYPE_RECOMMENDED, } UIdentifierType; /** * Does the set of Identifier_Type values code point c contain the given type? * * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * Each code point maps to a <i>set</i> of UIdentifierType values. * * @param c code point * @param type Identifier_Type to check * @return true if type is in Identifier_Type(c) * @draft ICU 75 */ U_CAPI UBool U_EXPORT2 u_hasIDType(UChar32 c, UIdentifierType type); /** * Writes code point c's Identifier_Type as a list of UIdentifierType values * to the output types array and returns the number of types. * * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * Each code point maps to a <i>set</i> of UIdentifierType values. * There is always at least one type. * * If there are more than capacity types to be written, then * U_BUFFER_OVERFLOW_ERROR is set and the number of types is returned. * (Usual ICU buffer handling behavior.) * * @param c code point * @param types output array * @param capacity capacity of the array * @param pErrorCode Standard ICU error code. Its input value must * pass the U_SUCCESS() test, or else the function returns * immediately. Check for U_FAILURE() on output or use with * function chaining. (See User Guide for details.) * @return number of values in c's Identifier_Type, * written to types unless U_BUFFER_OVERFLOW_ERROR indicates insufficient capacity * @draft ICU 75 */ U_CAPI int32_t U_EXPORT2 u_getIDTypes(UChar32 c, UIdentifierType *types, int32_t capacity, UErrorCode *pErrorCode); *Java UProperty.java* /** * Enumerated property Identifier_Status. * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * @draft ICU 75 */ public static final int IDENTIFIER_STATUS = 0x1019; /** * Miscellaneous property Identifier_Type. * Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * <p>Corresponds to {@link UCharacter#hasIdentifierType(int, UCharacter.IdentifierType)} and * {@link UCharacter#getIdentifierTypes(int, java.util.EnumSet)}. * * <p>Each code point maps to a <i>set</i> of IdentifierType values. * * @see UCharacter#hasIdentifierType(int, UCharacter.IdentifierType) * @see UCharacter#getIdentifierTypes(int, java.util.EnumSet) * @draft ICU 75 */ public static final int IDENTIFIER_TYPE = 0x7001; *Java UCharacter.java* /** * Identifier Status constants. * See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type. * * @see UProperty#IDENTIFIER_STATUS * @draft ICU 75 */ public enum IdentifierStatus { /** @draft ICU 75 */ RESTRICTED, /** @draft ICU 75 */ ALLOWED, } /** * Identifier Type constants. * See https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type. * * @see UProperty#IDENTIFIER_TYPE * @draft ICU 75 */ public enum IdentifierType { /** @draft ICU 75 */ NOT_CHARACTER, /** @draft ICU 75 */ DEPRECATED, /** @draft ICU 75 */ DEFAULT_IGNORABLE, /** @draft ICU 75 */ NOT_NFKC, /** @draft ICU 75 */ NOT_XID, /** @draft ICU 75 */ EXCLUSION, /** @draft ICU 75 */ OBSOLETE, /** @draft ICU 75 */ TECHNICAL, /** @draft ICU 75 */ UNCOMMON_USE, /** @draft ICU 75 */ LIMITED_USE, /** @draft ICU 75 */ INCLUSION, /** @draft ICU 75 */ RECOMMENDED, } /** * Does the set of Identifier_Type values code point c contain the given type? * * <p>Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * <p>Each code point maps to a <i>set</i> of UIdentifierType values. * * @param c code point * @param type Identifier_Type to check * @return true if type is in Identifier_Type(c) * @draft ICU 75 */ public static final boolean hasIdentifierType(int c, IdentifierType type) /** * Writes code point c's Identifier_Type as a set of IdentifierType values and * returns the number of types. * The set is cleared before c's types are added. * * <p>Used for UTS #39 General Security Profile for Identifiers * (https://www.unicode.org/reports/tr39/#General_Security_Profile). * * <p>Each code point maps to a <i>set</i> of IdentifierType values. * There is always at least one type. * * @param c code point * @param types output set * @return number of values in c's Identifier_Type * @draft ICU 75 */ public static final int getIdentifierTypes(int c, EnumSet<IdentifierType> types) Sincerely, markus (PS: I know this is last minute for 75...) |
From: Elango C. <el...@un...> - 2024-02-07 00:17:35
|
Hi Tim, Would you be able to allow one more week of time to get comments. I think by the Feb 15 meeting, we should be able to get followup comments in. -- Elango On Tue, Feb 6, 2024 at 2:02 PM Tim Chevalier <tj...@ig...> wrote: > Hello, > > I'm making one more request for comments before the TC meeting two days > from now, as I've received only a small number of comments between the > previous meeting and this one. > > I added a summary of recent changes at the beginning of the design > document. > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> > > Thanks, > > Tim > On 1/26/24 22:32, Tim Chevalier wrote: > > Hello all, > > This is another request for comments on the design document > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA>, > as it will be revisited in the TC meeting two weeks from now (February 8). > > Thanks, > > Tim > On 1/18/24 13:57, Tim Chevalier wrote: > > > On 12/5/23 20:28, Tim Chevalier wrote: > > > Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 > > Doc: https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA%20%3Chttps://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> > > > Hello all, > > I would like to request further comments on the design doc (linked above) > for the MessageFormat v2 tech preview for C++. > > I have updated the document to be consistent with the code in the pull > request (linked at the beginning) and with feedback received so far. It > should reflect the current proposed state of the MessageFormat v2 API. The > major open questions (using the STL and sequestering code in headers; the > representation of formattable values and relatedly, the type signatures for > custom functions) have been resolved. > > As this topic will be revisited in the TC meeting in three weeks (February > 8), feedback between now and then would be valuable. > > Thanks! > > Tim > > -- > You received this message because you are subscribed to the Google Groups > "ICU team" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to icu...@un.... > To view this discussion on the web visit > https://groups.google.com/a/unicode.org/d/msgid/icu-team/8fa72941-0e8e-4046-b8ef-7cde8bf523e0%40igalia.com > <https://groups.google.com/a/unicode.org/d/msgid/icu-team/8fa72941-0e8e-4046-b8ef-7cde8bf523e0%40igalia.com?utm_medium=email&utm_source=footer> > . > |
From: Tim C. <tj...@ig...> - 2024-02-06 23:09:16
|
On 2/6/24 15:05, Elango Cheran wrote: > Hi Tim, > Would you be able to allow one more week of time to get comments. I > think by the Feb 15 meeting, we should be able to get followup > comments in. Sounds good to me. I'm happy to join the meeting on Feb. 15 instead of Feb. 8. Thanks, Tim > > -- Elango > > On Tue, Feb 6, 2024 at 2:02 PM Tim Chevalier <tj...@ig...> wrote: > > Hello, > > I'm making one more request for comments before the TC meeting two > days from now, as I've received only a small number of comments > between the previous meeting and this one. > > I added a summary of recent changes at the beginning of the design > document. > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> > > Thanks, > > Tim > > On 1/26/24 22:32, Tim Chevalier wrote: >> >> Hello all, >> >> This is another request for comments on the design document >> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA>, >> as it will be revisited in the TC meeting two weeks from now >> (February 8). >> >> Thanks, >> >> Tim >> >> On 1/18/24 13:57, Tim Chevalier wrote: >>> >>> On 12/5/23 20:28, Tim Chevalier wrote: >>>> >>>> Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 >>>> >>>> Doc: >>>> https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >>>> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >>>> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA%20%3Chttps://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> >>>> >>>> >>> >>> Hello all, >>> >>> I would like to request further comments on the design doc >>> (linked above) for the MessageFormat v2 tech preview for C++. >>> >>> I have updated the document to be consistent with the code in >>> the pull request (linked at the beginning) and with feedback >>> received so far. It should reflect the current proposed state of >>> the MessageFormat v2 API. The major open questions (using the >>> STL and sequestering code in headers; the representation of >>> formattable values and relatedly, the type signatures for custom >>> functions) have been resolved. >>> >>> As this topic will be revisited in the TC meeting in three weeks >>> (February 8), feedback between now and then would be valuable. >>> >>> Thanks! >>> >>> Tim >>> > -- > You received this message because you are subscribed to the Google > Groups "ICU team" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to icu...@un.... > To view this discussion on the web visit > https://groups.google.com/a/unicode.org/d/msgid/icu-team/8fa72941-0e8e-4046-b8ef-7cde8bf523e0%40igalia.com > <https://groups.google.com/a/unicode.org/d/msgid/icu-team/8fa72941-0e8e-4046-b8ef-7cde8bf523e0%40igalia.com?utm_medium=email&utm_source=footer>. > |
From: Tim C. <tj...@ig...> - 2024-02-06 22:03:05
|
Hello, I'm making one more request for comments before the TC meeting two days from now, as I've received only a small number of comments between the previous meeting and this one. I added a summary of recent changes at the beginning of the design document. <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> Thanks, Tim On 1/26/24 22:32, Tim Chevalier wrote: > > Hello all, > > This is another request for comments on the design document > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA>, > as it will be revisited in the TC meeting two weeks from now (February 8). > > Thanks, > > Tim > > On 1/18/24 13:57, Tim Chevalier wrote: >> >> On 12/5/23 20:28, Tim Chevalier wrote: >>> >>> Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 >>> >>> Doc: >>> https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >>> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >>> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >>> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> >>> >>> >> >> Hello all, >> >> I would like to request further comments on the design doc (linked >> above) for the MessageFormat v2 tech preview for C++. >> >> I have updated the document to be consistent with the code in the >> pull request (linked at the beginning) and with feedback received so >> far. It should reflect the current proposed state of the >> MessageFormat v2 API. The major open questions (using the STL and >> sequestering code in headers; the representation of formattable >> values and relatedly, the type signatures for custom functions) have >> been resolved. >> >> As this topic will be revisited in the TC meeting in three weeks >> (February 8), feedback between now and then would be valuable. >> >> Thanks! >> >> Tim >> |
From: Tim C. <tj...@ig...> - 2024-01-27 07:02:57
|
Hello all, This is another request for comments on the design document <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA>, as it will be revisited in the TC meeting two weeks from now (February 8). Thanks, Tim On 1/18/24 13:57, Tim Chevalier wrote: > > On 12/5/23 20:28, Tim Chevalier wrote: >> >> Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 >> >> Doc: >> https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA >> <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> >> >> > > Hello all, > > I would like to request further comments on the design doc (linked > above) for the MessageFormat v2 tech preview for C++. > > I have updated the document to be consistent with the code in the pull > request (linked at the beginning) and with feedback received so far. > It should reflect the current proposed state of the MessageFormat v2 > API. The major open questions (using the STL and sequestering code in > headers; the representation of formattable values and relatedly, the > type signatures for custom functions) have been resolved. > > As this topic will be revisited in the TC meeting in three weeks > (February 8), feedback between now and then would be valuable. > > Thanks! > > Tim > |
From: Markus S. <mar...@gm...> - 2024-01-22 19:23:23
|
On Mon, Jan 22, 2024 at 10:13 AM Stefan Kandic <ste...@da...> wrote: > When I mentioned no extra memory allocations I was referring to strings > with latin characters only. In those cases we would avoid the copy that > happens in String’s constructor, plus getting a random character would be > trivial. > Do you mean ASCII? For non-ASCII Latin characters, the UTF-8 vs. UTF-16 indexes are not 1:1, so you might need some additional storage to efficiently handle that. Regarding the API, it would obviously be great to have this directly in > ICU, and if you decide to go in that direction I’d be happy to help out. > I suggest you submit a Jira feature request ticket. markus > |
From: Stefan K. <ste...@da...> - 2024-01-22 18:13:45
|
When I mentioned no extra memory allocations I was referring to strings with latin characters only. In those cases we would avoid the copy that happens in String’s constructor, plus getting a random character would be trivial. Regarding the API, it would obviously be great to have this directly in ICU, and if you decide to go in that direction I’d be happy to help out. However, we would also be completely fine with just opening up the doCompare method which is now for internal use only, and handling this on the client and not the library side. Regards, Stefan From: Markus Scherer <mar...@gm...> Date: Monday, 22 January 2024 at 18:31 To: Stefan Kandic <ste...@da...> Cc: icu...@li... <icu...@li...>, ale...@da... <ale...@da...>, iva...@da... <iva...@da...> Subject: Re: [icu-design] Collation on UTF8 Strings in Java On Mon, Jan 22, 2024 at 8:26 AM Stefan Kandic <ste...@da...<mailto:ste...@da...>> wrote: We could perform collation directly on the byte arrays without any extra memory allocation. AFAIK creating a String in java will always copy the given byte array and this way we could avoid that. I think you would want to cache some amount of context around the last-used charAt() index, which would also likely require some memory allocation beyond the CharSequence subclass itself. Do you think it would be possible to make this a part of the public API? Possible in principle. We would have a discussion in the team, but it shouldn't be a huge amount of code. Not sure how many users are interested in UTF-8 string handling in the JVM. markus |
From: Stefan K. <ste...@da...> - 2024-01-22 18:00:58
|
Thanks for the reply Markus. This idea of using a custom CharSequence could be interesting. Besides buffering the UTF-16 version, this would work very well for strings with only latin characters. We could perform collation directly on the byte arrays without any extra memory allocation. AFAIK creating a String in java will always copy the given byte array and this way we could avoid that. Do you think it would be possible to make this a part of the public API? Regards, Stefan From: Markus Scherer <mar...@gm...> Date: Friday, 12 January 2024 at 21:27 To: icu...@li... <icu...@li...> Cc: Stefan Kandic <ste...@da...> Subject: Re: [icu-design] Collation on UTF8 Strings in Java Hi Stefan, On Thu, Jan 11, 2024 at 10:58 AM Stefan Kandic via icu-design <icu...@li...<mailto:icu...@li...>> wrote: The issue is that Spark is written on JVM and uses UTF8 strings (as well as parquet for example) which means that it has to create a new UTF16 string for each collation comparison. This is obviously a big performance issue. It seems odd for anything working on top of the JVM to not work with UTF-16 strings. The whole Java language and library, and by extension Kotlin etc., work with UTF-16... My question for you is if it is possible somehow to directly support UTF8 strings (UTF8 encoded byte arrays) in the java API. I wouldn't have an issue of trying to implement it myself if it isn't too complicated as I'm not really familiar with the icu codebase. Possible, yes. Easy, no. ICU collation tries to read only as far into strings as it needs to, but that is also tangled up with incrementally performing normalization when needed, and that gets tricky. Also, I noticed that RuleBasedCollator uses CharSequences and not Strings, which means that icu4j is not fixed to java's String implementation and could technically support any implementation of CharSequence. Obviously, this is not trivial for a variable encoding like UTF8 as we can't have charAt(index) operation in constant time. CharSequence is a UTF-16 "string view" with random access. I suppose you could implement it with something that converts at least up to the requested index and buffers the UTF-16 version up to that point. If the strings are short, or the differences between strings occur early, then you might not gain much performance compared with just converting up front. Viele Grüße, markus |
From: Markus S. <mar...@gm...> - 2024-01-22 17:31:30
|
On Mon, Jan 22, 2024 at 8:26 AM Stefan Kandic <ste...@da...> wrote: > We could perform collation directly on the byte arrays without any extra > memory allocation. AFAIK creating a String in java will always copy the > given byte array and this way we could avoid that. > I think you would want to cache some amount of context around the last-used charAt() index, which would also likely require some memory allocation beyond the CharSequence subclass itself. Do you think it would be possible to make this a part of the public API? > Possible in principle. We would have a discussion in the team, but it shouldn't be a huge amount of code. Not sure how many users are interested in UTF-8 string handling in the JVM. markus > |
From: Tim C. <tj...@ig...> - 2024-01-18 21:58:01
|
On 12/5/23 20:28, Tim Chevalier wrote: > > Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 > > Doc: > https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA > <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?pli=1&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> > Hello all, I would like to request further comments on the design doc (linked above) for the MessageFormat v2 tech preview for C++. I have updated the document to be consistent with the code in the pull request (linked at the beginning) and with feedback received so far. It should reflect the current proposed state of the MessageFormat v2 API. The major open questions (using the STL and sequestering code in headers; the representation of formattable values and relatedly, the type signatures for custom functions) have been resolved. As this topic will be revisited in the TC meeting in three weeks (February 8), feedback between now and then would be valuable. Thanks! Tim |
From: Markus S. <mar...@gm...> - 2024-01-12 20:27:24
|
Hi Stefan, On Thu, Jan 11, 2024 at 10:58 AM Stefan Kandic via icu-design < icu...@li...> wrote: > The issue is that Spark is written on JVM and uses UTF8 strings (as well > as parquet for example) which means that it has to create a new UTF16 > string for each collation comparison. This is obviously a big performance > issue. > It seems odd for anything working on top of the JVM to not work with UTF-16 strings. The whole Java language and library, and by extension Kotlin etc., work with UTF-16... My question for you is if it is possible somehow to directly support UTF8 > strings (UTF8 encoded byte arrays) in the java API. I wouldn't have an > issue of trying to implement it myself if it isn't too complicated as I'm > not really familiar with the icu codebase. > Possible, yes. Easy, no. ICU collation tries to read only as far into strings as it needs to, but that is also tangled up with incrementally performing normalization when needed, and that gets tricky. Also, I noticed that RuleBasedCollator uses CharSequences and not Strings, > which means that icu4j is not fixed to java's String implementation and > could technically support any implementation of CharSequence. Obviously, > this is not trivial for a variable encoding like UTF8 as we can't have > charAt(index) operation in constant time. > CharSequence is a UTF-16 "string view" with random access. I suppose you could implement it with something that converts at least up to the requested index and buffers the UTF-16 version up to that point. If the strings are short, or the differences between strings occur early, then you might not gain much performance compared with just converting up front. Viele Grüße, markus |
From: Stefan K. <ste...@da...> - 2024-01-11 18:57:51
|
Hi folks, My name is Stefan and I work at Databricks where we're thinking of adding collation support for Spark. The issue is that Spark is written on JVM and uses UTF8 strings (as well as parquet for example) which means that it has to create a new UTF16 string for each collation comparison. This is obviously a big performance issue. I read in the docs that ICU doesn't support UTF8 for java and only for native c/c++; so I tried using a JNI to call the native API directly. However, probably due to JNI's overhead this was even slower than just using the java API. My question for you is if it is possible somehow to directly support UTF8 strings (UTF8 encoded byte arrays) in the java API. I wouldn't have an issue of trying to implement it myself if it isn't too complicated as I'm not really familiar with the icu codebase. Also, I noticed that RuleBasedCollator uses CharSequences and not Strings, which means that icu4j is not fixed to java's String implementation and could technically support any implementation of CharSequence. Obviously, this is not trivial for a variable encoding like UTF8 as we can't have charAt(index) operation in constant time. I'm far from a collation expert so please forgive my ignorance. Any feedback would be greatly appreciated :) Regards, Stefan |
From: Fredrik R. <ro...@go...> - 2023-12-11 20:12:18
|
On Mon, Dec 11, 2023 at 9:09 PM Markus Scherer <mar...@gm...> wrote: > In 2024, starting with ICU 75, we are now planning to require C++17 and C11, I've now sent this PR for review to update to C++17: https://github.com/unicode-org/icu/pull/2737 -- Fredrik Roubert ro...@go... |
From: Markus S. <mar...@gm...> - 2023-12-11 20:09:16
|
Dear ICU users, Thank you for the discussion nearly a year ago. During 2023, we stuck with C++11 and C99. In 2024, starting with ICU 75, we are now planning to require C++17 and C11, and to take advantage of some of the new features. Best regards, Markus Scherer On Thu, Jan 12, 2023 at 11:23 AM Markus Scherer <mar...@gm...> wrote: > Dear ICU users, > > We are considering moving ICU from requiring just C++11 to requiring C++17. > Please let us know if this would be a problem for you (and for how long). > > C++17 added some useful features. Among the most relevant for ICU is the > UTF-16 version of std::string_view (similar to our StringPiece but for > char16_t), which would be more reliable for aliasing strings than wrapping > them into a UnicodeString. > > https://en.wikipedia.org/wiki/C%2B%2B14 > https://en.wikipedia.org/wiki/C%2B%2B17 > > https://github.com/AnthonyCalandra/modern-cpp-features#c17-language-features > > FYI: > We upgraded to C++11 in 2017-04 <https://icu.unicode.org/download/59>, > some 6 years after the standard was released. > If we now upgrade to C++17, it will be another 6 years after that version. > > Moving to C++17 should be a smaller jump than moving to C++11 was. C++11 > was a huge change, and it looks like compilers were either never updated to > fully support C++11, or they did get updated and are keeping up with modern > versions of C++. > > Best regards, > Markus Scherer > |
From: <co...@co...> - 2023-12-11 01:43:05
|
👍 On Thursday, December 07, 2023 12:57 PM PST, Rich Gillam via icu-design <icu...@li...> wrote: > 👍 > > > On Dec 7, 2023, at 11:31 AM, Markus Scherer <mar...@gm...> wrote: > > > > This is new API for: ICU 75 > > > > Ticket: https://unicode-org.atlassian.net/browse/ICU-22571 > > > > We are going to add API constants for the new-to-ICU script code "Aran". It is one of several ISO 15924 script codes <https://www.unicode.org/iso15924/iso15924-codes.html> that are aliases or variants of other scripts. (Other examples: Latf=Latin in Fraktur; Syrn=Syriac, eastern variant) > > > > We are treating this addition like API constants for stable Unicode property values: The new API constants will be "born @stable" (they are not @draft first for a while) so that they are immediately usable without problems. > > > > For "normal" script codes for encoded scripts, we wait until Unicode encodes the scripts, so that the names of the scripts are settled and we can create a stable, mnemonic API constant. > > > > Alias/variant script codes do not correspond to their own scripts, so we can add them as needed. > > We found that currently "Aran" is the only alias/variant script code that was missing from ICU. > > > > New constants: > > > > C/C++ > > > > unicode/uscript.h > > > > typedef enum UScriptCode { > > /** @stable ICU 75 */ > > USCRIPT_ARABIC_NASTALIQ = 200, /* Aran */ > > > > Java > > > > public final class UScript { > > /** @stable ICU 75 */ > > public static final int ARABIC_NASTALIQ = 200; /* Aran */ > > > > Sincerely, > > markus > > _______________________________________________ > > icu-design mailing list > > icu...@li... > > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design > |
From: Rich G. <ric...@ap...> - 2023-12-07 21:58:09
|
👍 > On Dec 7, 2023, at 11:31 AM, Markus Scherer <mar...@gm...> wrote: > > This is new API for: ICU 75 > > Ticket: https://unicode-org.atlassian.net/browse/ICU-22571 > > We are going to add API constants for the new-to-ICU script code "Aran". It is one of several ISO 15924 script codes <https://www.unicode.org/iso15924/iso15924-codes.html> that are aliases or variants of other scripts. (Other examples: Latf=Latin in Fraktur; Syrn=Syriac, eastern variant) > > We are treating this addition like API constants for stable Unicode property values: The new API constants will be "born @stable" (they are not @draft first for a while) so that they are immediately usable without problems. > > For "normal" script codes for encoded scripts, we wait until Unicode encodes the scripts, so that the names of the scripts are settled and we can create a stable, mnemonic API constant. > > Alias/variant script codes do not correspond to their own scripts, so we can add them as needed. > We found that currently "Aran" is the only alias/variant script code that was missing from ICU. > > New constants: > > C/C++ > > unicode/uscript.h > > typedef enum UScriptCode { > /** @stable ICU 75 */ > USCRIPT_ARABIC_NASTALIQ = 200, /* Aran */ > > Java > > public final class UScript { > /** @stable ICU 75 */ > public static final int ARABIC_NASTALIQ = 200; /* Aran */ > > Sincerely, > markus > _______________________________________________ > icu-design mailing list > icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design |
From: Markus S. <mar...@gm...> - 2023-12-07 19:32:27
|
This is new API for: ICU 75 Ticket: https://unicode-org.atlassian.net/browse/ICU-22571 We are going to add API constants for the new-to-ICU script code "Aran". It is one of several ISO 15924 script codes <https://www.unicode.org/iso15924/iso15924-codes.html> that are aliases or variants of other scripts. (Other examples: Latf=Latin in Fraktur; Syrn=Syriac, eastern variant) We are treating this addition like API constants for stable Unicode property values: The new API constants will be "born @stable" (they are not @draft first for a while) so that they are immediately usable without problems. For "normal" script codes for encoded scripts, we wait until Unicode encodes the scripts, so that the names of the scripts are settled and we can create a stable, mnemonic API constant. Alias/variant script codes do not correspond to their own scripts, so we can add them as needed. We found that currently "Aran" is the only alias/variant script code that was missing from ICU. New constants: *C/C++* *unicode/uscript.h* typedef enum UScriptCode { /** @stable ICU 75 */ *USCRIPT_ARABIC_NASTALIQ* = 200, /* Aran */ *Java* public final class *UScript* { /** @stable ICU 75 */ public static final int *ARABIC_NASTALIQ* = 200; /* Aran */ Sincerely, markus |
From: Tim C. <tj...@ig...> - 2023-12-06 04:53:22
|
Dear ICU team & users, I'd like to request additional feedback on the design doc proposing a MessageFormat v2 tech preview for C++ in ICU4C. I'm also hoping that this could be added to the agenda for the TC meeting this Thursday, December 7. I'm sorry for the short notice, but I had thought that I was already on the agenda, and once I talked with Elango, realized that I wasn't. I have addressed all comments or will be doing so shortly. I have made minor changes to the proposed API in response to feedback. I have also added a summary of the open technical questions that need to be resolved based on previous comments. Thanks to Mihai, Markus, George, Eemeli, Elango, and Richard for their previous comments. Ticket: https://unicode-org.atlassian.net/browse/ICU-22261 Doc: https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?usp=sharing&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA <https://docs.google.com/document/d/1qi5zeYevPshu-Gr8-nn7Rsq-WFOAx_zNxOVJkRz5cWM/edit?usp=sharing&resourcekey=0-dUYbqvZh5vTEJRPNoZ4ehA> Previous email: https://sourceforge.net/p/icu/mailman/message/37888758/ Thanks, Tim Chevalier |
From: David M. <dm...@mi...> - 2023-11-09 18:03:08
|
Hello all, We would appreciate if folks have time to take another look at the collation folding API proposal in the coming week. We are planning on discussing this API in detail in next week's (2023-11-16) TC meeting; please review and offer comments in advance, thank you! Daniel and David -----Original Message----- From: David Matson Sent: Monday, 25 September 2023 15:59 To: Rich Gillam <ric...@ap...> Cc: icu-design <icu...@li...> Subject: RE: [EXTERNAL] Re: [icu-design] ICU4C API proposal: Add C APIs to support Collation Folding Sure; done. Thanks, David From: Rich Gillam <ric...@ap...> Sent: Monday, 25 September 2023 14:21 To: David Matson <dm...@mi...> Cc: icu-design <icu...@li...> Subject: [EXTERNAL] Re: [icu-design] ICU4C API proposal: Add C APIs to support Collation Folding Would you mind also posting this to https://docs.google.com/document/d/1FyenmKoyi45wp8bfBF4B4thuxiYmQb_R6i8__94IsfM/edit#heading=h.9xjrp1cltuqc ? It’s a lot easier to comment on in Google Docs. —Rich On Sep 22, 2023, at 3:42 PM, David Matson via icu-design <mailto:icu...@li...> wrote: (Daniel originally sent the message below on 12 September 2023, but it appears that it did not go through. Attempting to send again, this time in plain text format.) Dear ICU team & users, We would like to propose new C APIs for Collation Folding for ICU 75. Please provide your feedback by Thursday, September 21. As discussed in the ICU-TC meeting on 2023-Aug-31, we would like to work on implementing collation folding APIs: https://unicode-org.atlassian.net/browse/ICU-22422 As the next step, we'd like to do a design review on the API specifics. The following is a proposal for a draft C API for ICU4C: In the (new header) file: icu/source/i18n/unicode/ucolf.h /** * Constructs a UCollationFolding instance for the given locale and strength. * * @param locale The locale to use for collation rules. Only locales that have data * for string search are supported. An implicit "-u-co-search" is * appended to the locale name before looking up the data. * * Special values for locales can be passed in - if NULL is passed * for the locale, the default locale will be used. If an empty * string ("") or "root" is passed, the root collator will be used. * * @param strength The collation strength; only UCOL_PRIMARY, UCOL_SECONDARY, * UCOL_TERTIARY, and UCOL_DEFAULT are supported. UCOL_DEFAULT * means UCOL_PRIMARY. Other collation strengths are not supported. * * @param status The error code, set if an error occurred while creating the * UCollationFolding instance. The values U_USING_FALLBACK_WARNING and * U_USING_DEFAULT_WARNING are reported to indicate whether specialized * string search data for the given input locale (-u-co-search) exists. * * @return The newly created UCollationFolding instance. * @see UCollationStrength * @draft ICU 75 */ U_CAPI UCollationFolding* U_EXPORT2 ucolf_open(const char* locale, UCollationStrength strength, UErrorCode* status); /** * Close a UCollationFolding instance, releasing the memory used. * Once closed, it should not be used. * * @param ucolf The UCollationFolding instance to close. * @draft ICU 75 */ U_CAPI void U_EXPORT2 ucolf_close(UCollationFolding* ucolf); /** * Return the equivalence class string that the source string folds to, for the * input UCollationFolding instance. * * @param ucolf The UCollationFolding instance to use. * @param source The source string. * @param sourceLength The length of the source string, or -1 if NULL-terminated. * @param destination A pointer to a buffer to receive the NULL-terminated output. If * the output fits into destination but cannot be NULL-terminated * (length == destinationCapacity) then the error code is set to * U_STRING_NOT_TERMINATED_WARNING. If the output doesn't fit into * destination then the error code is set to U_BUFFER_OVERFLOW_ERROR. * @param destinationCapacity The maximum size of the destination buffer. * @return The actual buffer size needed for the destination string. If greater * than destinationCapacity, the returned destination string will be * truncated and an error code will be returned. * @draft ICU 75 */ U_CAPI int32_t U_EXPORT2 ucolf_fold(const UCollationFolding* ucolf, const UChar* source, int32_t sourceLength, UChar* destination, int32_t destinationCapacity, UErrorCode* status); #if U_SHOW_CPLUSPLUS_API U_NAMESPACE_BEGIN /** * \class LocalUCollationFoldingPointer * "Smart pointer" class, closes a UCollationFolding via ucolf_close(). * For most methods see the LocalPointerBase base class. * * @see LocalPointerBase * @see LocalPointer * @draft ICU 75 */ U_DEFINE_LOCAL_OPEN_POINTER(LocalUCollationFoldingPointer, UCollationFolding, ucolf_close); U_NAMESPACE_END #endif Some notes: 1. This proposal uses "ucolf" as a prefix for the "Unicode collation folding" family of functions. 2. As discussed in the TC meeting, building the map at runtime is cost-prohibitive, so the API to create a UCollationFolding object takes the locale as input rather than a UCollator*. As a result, the data tables needed to do collation folding can be built ahead of time and included with ICU (assuming the size is reasonable, which seems likely given our initial investigation). 3. As also discussed in the TC meeting, we wanted to support at least primary and secondary strengths (supporting identical would be useless/a no-op). Based on investigating implementation, we believe we can also support tertiary without much extra work, but not quaternary, so the proposal is to support strengths of primary, secondary, and tertiary and report an error if any other strength is requested. 4. We propose not to support tailorings at this time. 5. As discussed in the TC meeting, we only intend to support -u-co-search locales, since they are the right match for this API. We propose that callers specify the locale string without this prefix for simplicity (for example, just passing a locale string such as "en_US" or "de_DE", which would map based on the corresponding -u-co-search data following normal locale search rules; in these cases, und-u-co-search and de-u-co-search, respectively). The status (including U_USING_FALLBACK_WARNING and U_USING_DEFAULT_WARNING) would be reported based on whether specialized data for <input locale>-u-co-search exists. Pseudo-code for how this API would work (without checking for errors): UErrorCode ignore = U_ZERO_ERROR; UCollationFolding folding = ucolf_open("de_DE", UCOL_PRIMARY, &ignore); UChar buffer[100]; int32_t resultSize = ucolf_fold(folding, u"Käse", -1, buffer, 100, ignore); // Note: austrdup used to convert UChar* to char* printf("%s\n", austrdup(buffer)); ucolf_close(ufolding); Prints: kaese Note: "käse" could be an alternative output here, in the same "equivalence class," but doing so likely means that "oboe" would map to "obö" which might be worse. We'll leave determining exactly what the target mapping should be for a separate discussion from the API shape design review here. With loc of und/root, primary strength, and input of "Käse", prints: kase With loc of "en_US" (or root), primary strength, and input of "Résumé", prints: resume With loc of "fr_FR" (or root), secondary strength, and input of "Résumé", prints: résumé Please let us know what you think. Thanks, David, Daniel, and Jeff _______________________________________________ icu-design mailing list mailto:icu...@li... To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-design |
From: Frank T. (譚. <ft...@go...> - 2023-11-01 23:07:43
|
Here is the right diff icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java @@ -1081,6 +1081,18 @@ private static int CISetIndex32(CharacterIterator ci, int index) { return ci.getIndex(); } + /** + * Register a new external break engine. The external break engine will be adopted. + * Because ICU may choose to cache break engine internally, this must + * be called at application startup, prior to any calls to + * object methods of RuleBasedBreakIterator to avoid undefined behavior. +FrankYFTang marked this conversation as resolved. + * @param engine the ExternalBreakEngine instance to be adopted + * @internal ICU 75 technology preview + */ + public static void registerExternalBreakEngine(ExternalBreakEngine engine) { + } + /** DictionaryCache stores the boundaries obtained from a run of dictionary characters. * Dictionary boundaries are moved first to this cache, then from here * to the main BreakCache, where they may inter-leave with non-dictionary @@ -1885,7 +1897,43 @@ void dumpCache() { }; + interface ExternalBreakEngine { + /** + * <p>Indicate whether this engine handles a particular character when + * the RuleBasedBreakIterator is used for a particular locale. This method is used + * by the RuleBasedBreakIterator to find a break engine.</p> + * @param c A character which begins a run that the engine might handle. + * @param locale The locale. + * @return true if this engine handles the particular character for that locale. + * @internal ICU 75 technology preview + */ + boolean isFor(char c, String locale); + /** + * <p>Indicate whether this engine handles a particular character.This method is + * used by the RuleBasedBreakIterator after it already find a break engine to see which + * characters after the first one can be handled by this break engine.</p> + * @param c A character that the engine might handle. + * @return true if this engine handles the particular character. + * @internal ICU 75 technology preview + */ + boolean handles(char c); + + /** + * <p>Divide up a range of text handled by this break engine.</p> + * + * @param text A CharacterIterator representing the text + * @param rangeStart The start of the range of known characters + * @param rangeEnd The end of the range of known characters + * @param foundBreaks Output of a list of Integer to denote break positions. + * @return The number of breaks found + * @internal ICU 75 technology preview + */ + int fillBreaks(CharacterIterator text, + int rangeStart, + int rangeEnd, + List<Integer> foundBreaks); + } } On Wed, 1 Nov 2023 at 15:59, Frank Tang (譚永鋒) <ft...@go...> wrote: > sorry I attached the wrong diff, will update the real one soon. > > On Wed, 1 Nov 2023 at 15:55, Frank Tang (譚永鋒) <ft...@go...> wrote: > >> Dear ICU team & users, >> >> I would like to propose the following for: ICU 75 >> Please provide feedback by: Next Wednesday, Nov 8, or any time >> sufficiently in advance of the feature freeze >> Designated API review: Rich Gillam >> Issue: https://unicode-org.atlassian.net/browse/ICU-22564 >> >> During the ICU74 development cycle, I proposed and implemented the " >> ICU API Proposal to register an External Break Engine" for C++ under >> ICU-22342 and landed into https://github.com/unicode-org/icu/pull/2418 >> >> This is the Java counterpart for that under a new ticket ICU-22564 >> >> icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java >> @@ -1081,6 +1081,18 @@ private static int CISetIndex32(CharacterIterator >> ci, int index) { >> return ci.getIndex(); >> } >> >> + /** >> + * Register a new external break engine. The external break engine >> will be adopted. >> + * Because ICU may choose to cache break engine internally, this must >> + * be called at application startup, prior to any calls to >> + * object methods of RuleBasedBreakIterator to avoid undefined >> behavior. >> +FrankYFTang marked this conversation as resolved. >> + * @param toAdopt the ExternalBreakEngine instance to be adopted >> + * @internal ICU 75 technology preview >> + */ >> + public static void registerExternalBreakEngine(ExternalBreakEngine >> engine) { >> + } >> + >> /** DictionaryCache stores the boundaries obtained from a run of >> dictionary characters. >> * Dictionary boundaries are moved first to this >> cache, then from here >> * to the main BreakCache, where they may >> inter-leave with non-dictionary >> @@ -1885,7 +1897,46 @@ void dumpCache() { >> }; >> >> >> + interface ExternalBreakEngine { >> + /** >> + * <p>Indicate whether this engine handles a particular >> character when >> + * the RuleBasedBreakIterator is used for a particular locale. >> This method is used >> + * by the RuleBasedBreakIterator to find a break engine.</p> >> + * @param c A character which begins a run that the engine might >> handle. >> + * @param locale The locale. >> + * @return true if this engine handles the particular character >> for that locale. >> + * @internal ICU 75 technology preview >> + */ >> + boolean isFor(char c, String locale); >> >> + /** >> + * <p>Indicate whether this engine handles a particular >> character.This method is >> + * used by the RuleBasedBreakIterator after it already find a >> break engine to see which >> + * characters after the first one can be handled by this break >> engine.</p> >> + * @param c A character that the engine might handle. >> + * @return true if this engine handles the particular character. >> + * @internal ICU 75 technology preview >> + */ >> + boolean handles(char c); >> + >> + /** >> + * <p>Divide up a range of text handled by this break engine.</p> >> + * >> + * @param text A UText representing the text >> + * @param start The start of the range of known characters >> + * @param end The end of the range of known characters >> + * @param foundBreaks Output of C array of int32_t break >> positions, or >> + * nullptr >> + * @param foundBreaksCapacity The capacity of foundBreaks >> + * @param status Information on any errors encountered. >> + * @return The number of breaks found >> + * @internal ICU 75 technology preview >> + */ >> + int fillBreaks(CharacterIterator text, >> + int rangeStart, >> + int rangeEnd, >> + List<Integer> foundBreaks); >> + } >> >> } >> >> >> >> >> -- >> Frank Yung-Fong Tang >> 譚永鋒 / 🌭🍊 >> Sr. Software Engineer >> > > > -- > Frank Yung-Fong Tang > 譚永鋒 / 🌭🍊 > Sr. Software Engineer > -- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer |
From: Frank T. (譚. <ft...@go...> - 2023-11-01 23:04:35
|
sorry I attached the wrong diff, will update the real one soon. On Wed, 1 Nov 2023 at 15:55, Frank Tang (譚永鋒) <ft...@go...> wrote: > Dear ICU team & users, > > I would like to propose the following for: ICU 75 > Please provide feedback by: Next Wednesday, Nov 8, or any time > sufficiently in advance of the feature freeze > Designated API review: Rich Gillam > Issue: https://unicode-org.atlassian.net/browse/ICU-22564 > > During the ICU74 development cycle, I proposed and implemented the " > ICU API Proposal to register an External Break Engine" for C++ under > ICU-22342 and landed into https://github.com/unicode-org/icu/pull/2418 > > This is the Java counterpart for that under a new ticket ICU-22564 > > icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java > @@ -1081,6 +1081,18 @@ private static int CISetIndex32(CharacterIterator > ci, int index) { > return ci.getIndex(); > } > > + /** > + * Register a new external break engine. The external break engine > will be adopted. > + * Because ICU may choose to cache break engine internally, this must > + * be called at application startup, prior to any calls to > + * object methods of RuleBasedBreakIterator to avoid undefined > behavior. > +FrankYFTang marked this conversation as resolved. > + * @param toAdopt the ExternalBreakEngine instance to be adopted > + * @internal ICU 75 technology preview > + */ > + public static void registerExternalBreakEngine(ExternalBreakEngine > engine) { > + } > + > /** DictionaryCache stores the boundaries obtained from a run of > dictionary characters. > * Dictionary boundaries are moved first to this > cache, then from here > * to the main BreakCache, where they may inter-leave > with non-dictionary > @@ -1885,7 +1897,46 @@ void dumpCache() { > }; > > > + interface ExternalBreakEngine { > + /** > + * <p>Indicate whether this engine handles a particular character > when > + * the RuleBasedBreakIterator is used for a particular locale. > This method is used > + * by the RuleBasedBreakIterator to find a break engine.</p> > + * @param c A character which begins a run that the engine might > handle. > + * @param locale The locale. > + * @return true if this engine handles the particular character > for that locale. > + * @internal ICU 75 technology preview > + */ > + boolean isFor(char c, String locale); > > + /** > + * <p>Indicate whether this engine handles a particular > character.This method is > + * used by the RuleBasedBreakIterator after it already find a > break engine to see which > + * characters after the first one can be handled by this break > engine.</p> > + * @param c A character that the engine might handle. > + * @return true if this engine handles the particular character. > + * @internal ICU 75 technology preview > + */ > + boolean handles(char c); > + > + /** > + * <p>Divide up a range of text handled by this break engine.</p> > + * > + * @param text A UText representing the text > + * @param start The start of the range of known characters > + * @param end The end of the range of known characters > + * @param foundBreaks Output of C array of int32_t break > positions, or > + * nullptr > + * @param foundBreaksCapacity The capacity of foundBreaks > + * @param status Information on any errors encountered. > + * @return The number of breaks found > + * @internal ICU 75 technology preview > + */ > + int fillBreaks(CharacterIterator text, > + int rangeStart, > + int rangeEnd, > + List<Integer> foundBreaks); > + } > > } > > > > > -- > Frank Yung-Fong Tang > 譚永鋒 / 🌭🍊 > Sr. Software Engineer > -- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer |
From: Frank T. (譚. <ft...@go...> - 2023-11-01 23:01:44
|
Dear ICU team & users, I would like to propose the following for: ICU 75 Please provide feedback by: Next Wednesday, Nov 8, or any time sufficiently in advance of the feature freeze Designated API review: Rich Gillam Issue: https://unicode-org.atlassian.net/browse/ICU-22564 During the ICU74 development cycle, I proposed and implemented the " ICU API Proposal to register an External Break Engine" for C++ under ICU-22342 and landed into https://github.com/unicode-org/icu/pull/2418 This is the Java counterpart for that under a new ticket ICU-22564 icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java @@ -1081,6 +1081,18 @@ private static int CISetIndex32(CharacterIterator ci, int index) { return ci.getIndex(); } + /** + * Register a new external break engine. The external break engine will be adopted. + * Because ICU may choose to cache break engine internally, this must + * be called at application startup, prior to any calls to + * object methods of RuleBasedBreakIterator to avoid undefined behavior. +FrankYFTang marked this conversation as resolved. + * @param toAdopt the ExternalBreakEngine instance to be adopted + * @internal ICU 75 technology preview + */ + public static void registerExternalBreakEngine(ExternalBreakEngine engine) { + } + /** DictionaryCache stores the boundaries obtained from a run of dictionary characters. * Dictionary boundaries are moved first to this cache, then from here * to the main BreakCache, where they may inter-leave with non-dictionary @@ -1885,7 +1897,46 @@ void dumpCache() { }; + interface ExternalBreakEngine { + /** + * <p>Indicate whether this engine handles a particular character when + * the RuleBasedBreakIterator is used for a particular locale. This method is used + * by the RuleBasedBreakIterator to find a break engine.</p> + * @param c A character which begins a run that the engine might handle. + * @param locale The locale. + * @return true if this engine handles the particular character for that locale. + * @internal ICU 75 technology preview + */ + boolean isFor(char c, String locale); + /** + * <p>Indicate whether this engine handles a particular character.This method is + * used by the RuleBasedBreakIterator after it already find a break engine to see which + * characters after the first one can be handled by this break engine.</p> + * @param c A character that the engine might handle. + * @return true if this engine handles the particular character. + * @internal ICU 75 technology preview + */ + boolean handles(char c); + + /** + * <p>Divide up a range of text handled by this break engine.</p> + * + * @param text A UText representing the text + * @param start The start of the range of known characters + * @param end The end of the range of known characters + * @param foundBreaks Output of C array of int32_t break positions, or + * nullptr + * @param foundBreaksCapacity The capacity of foundBreaks + * @param status Information on any errors encountered. + * @return The number of breaks found + * @internal ICU 75 technology preview + */ + int fillBreaks(CharacterIterator text, + int rangeStart, + int rangeEnd, + List<Integer> foundBreaks); + } } -- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer |