|
From: Uroš B. <uro...@da...> - 2024-02-26 11:39:35
|
Hi, I have a question regarding the use of CollationKey <https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/CollationKey.html> to check whether one string "contains" the other (i.e. right string is found anywhere in the left string, accounting for any specified rule-based collation using ICU4J). With this, my use case in Java would be something like: *contains(String left, String right, String collation)*. Suppose that *collation* here is a parameter indicating the collation at hand (for example: "Latin1_General_CS_AI"), and is used to get the appropriate instance of *com.ibm.icu.text.Collator* (exact routing for this collation is handled elsewhere in the codebase). Problem description Due to the nature of this operation, using *Collator.compare(String, String)* proves inefficient for this problem, because it would require allocating O(N) substrings of *left *before calling *compare(left.substring(), right)*. Suppose N here is the length of the *left* string. Example: *contains*("Abć", "a", "Latin1_General_CS_AI"); // returns false - calls: *collator.compare("A", "a")* // returns false ("A" here is "Abć".substring(0,1)) - calls: *collator.compare("b", "a")* // returns false ("b" here is "Abć".substring(1,2)) - calls: *collator.compare("ć", "a")* // returns false ("ć" here is "Abć".substring(2,3)) Here, this approach allocates *3 new strings* in order to do the comparisons. Using CollationKey As I understood, *com.ibm.icu.text.CollationKey* is the way to go for repeated comparison of strings. Here, I would like to compare strings in a way that only requires generating one key for *left* (let's call it *leftKey*) and one key for *right* (let's call it *rightKey*), and then comparing these arrays in-place, byte per byte. However, it doesn't seem that this operation is supported out-of-the-box with *CollationKey*. While one can easily use two collation keys for equality comparison and collation-aware ordering, I'm not sure if this holds for substring operations as well? Given a collation key for "Abć", is there a constant-time way to obtain collation keys for "A", "b", and "ć"? Ideally, I would want to only traverse the "Abć" collation key (*leftKey*) as a plain byte array, and do in-place comparison with the "ć" collation key (*rightKey*) as a plain byte array. However, it doesn't seem straightforward given the structure of the collation key (suffixes, etc.) public boolean contains(String left, String right, String collation) { > Collator collator = ...(collation); > // get collation keys > CollationKey leftKey = collator.getCollationKey(left); > CollationKey rightKey = collator.getCollationKey(right); > // get byte arrays > byte[] lBytes = leftKey.toByteArray(); > byte[] rBytes = rightKey.toByteArray(); > // in-place comparison > for (int i = 0; i <= lBytes.length - rBytes.length; i++) { > if (compareKeys(lBytes, rBytes, i)) { > return true; > } > } > return false; > } Suppose there's a simple helper function such as: > private boolean compareKeys(byte[] lBytes, byte[] rBytes, int offset) { > int len = rBytes.length; > // compare lBytes[i, i+len] to rBytes[0, len] in-place, byte by byte... > } Could you please provide any support regarding how to implement this solution so that it fully takes into account the collation key byte array structure? As of now, this simple comparison doesn't work because there are some suffixes in both *leftKey* and *rightKey*, so exact comparison is not possible, but I'm wondering if there is a way to go around this. I appreciate your support, and please let me know if any additional details could be helpful from my side. Many thanks! Best, Uroš P.S. Alternative It turns out that making use of *Collator.compare(Object, **Object**)* instead of *Collator.compare(String, **String**)* doesn't prove to be any better either, because it does *toString()* anyway, regressing the performance in a similar fashion. Ideally, an implementation such as *Collator.compare(Character, **Character**)* could do the trick, however only under the condition that it would *not allocate* a new *String* for the two arguments. This would allow traversing *left* and *right* strings and comparing individual characters just by using *String.charAt* (with no extra *String* allocation whatsoever). However, I don't believe there is currently anything like *Collator.compare(**Character**, **Character**)* that works exactly like this. So for now, I'm trying to implement this functionality using *CollationKey*. |