[icu-support] CollationKey for efficient collation-aware in-place substring comparison

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I have a question regarding the use of CollationKey
<https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/CollationKey.html>
to check whether one string "contains" the other (i.e. right string is
found anywhere in the left string, accounting for any specified rule-based
collation using ICU4J). With this, my use case in Java would be something
like: *contains(String left, String right, String collation)*. Suppose that
*collation* here is a parameter indicating the collation at hand (for
example: "Latin1_General_CS_AI"), and is used to get the appropriate
instance of *com.ibm.icu.text.Collator* (exact routing for this collation
is handled elsewhere in the codebase).

Problem description

Due to the nature of this operation, using *Collator.compare(String,
String)* proves inefficient for this problem, because it would require
allocating O(N) substrings of *left *before calling *compare(left.substring(),
right)*. Suppose N here is the length of the *left* string.

Example: *contains*("Abć", "a", "Latin1_General_CS_AI"); // returns false
  - calls: *collator.compare("A", "a")* // returns false ("A" here is
"Abć".substring(0,1))
  - calls: *collator.compare("b", "a")* // returns false ("b" here is
"Abć".substring(1,2))
  - calls: *collator.compare("ć", "a")* // returns false ("ć" here is
"Abć".substring(2,3))
Here, this approach allocates *3 new strings* in order to do the
comparisons.

Using CollationKey

As I understood, *com.ibm.icu.text.CollationKey* is the way to go for
repeated comparison of strings. Here, I would like to compare strings in a
way that only requires generating one key for *left* (let's call it
*leftKey*) and one key for *right* (let's call it *rightKey*), and then
comparing these arrays in-place, byte per byte.

However, it doesn't seem that this operation is supported out-of-the-box
with *CollationKey*. While one can easily use two collation keys
for equality comparison and collation-aware ordering, I'm not sure if this
holds for substring operations as well? Given a collation key for "Abć", is
there a constant-time way to obtain collation keys for "A", "b", and "ć"?
Ideally, I would want to only traverse the "Abć" collation key (*leftKey*)
as a plain byte array, and do in-place comparison with the "ć" collation
key (*rightKey*) as a plain byte array. However, it doesn't seem
straightforward given the structure of the collation key (suffixes, etc.)

public boolean contains(String left, String right, String collation) {
>   Collator collator = ...(collation);
>   // get collation keys
>   CollationKey leftKey = collator.getCollationKey(left);
>   CollationKey rightKey = collator.getCollationKey(right);
>   // get byte arrays
>   byte[] lBytes = leftKey.toByteArray();
>   byte[] rBytes = rightKey.toByteArray();
>   // in-place comparison
>   for (int i = 0; i <= lBytes.length - rBytes.length; i++) {
>     if (compareKeys(lBytes, rBytes, i)) {
>       return true;
>     }
>   }
>   return false;
> }

Suppose there's a simple helper function such as:

> private boolean compareKeys(byte[] lBytes, byte[] rBytes, int offset) {
>   int len = rBytes.length;
>   // compare lBytes[i, i+len] to rBytes[0, len] in-place, byte by byte...
> }

Could you please provide any support regarding how to implement this
solution so that it fully takes into account the collation key byte array
structure? As of now, this simple comparison doesn't work because there are
some suffixes in both *leftKey* and *rightKey*, so exact comparison is not
possible, but I'm wondering if there is a way to go around this.

I appreciate your support, and please let me know if any additional details
could be helpful from my side. Many thanks!

Best,
Uroš

P.S.

Alternative

It turns out that making use of *Collator.compare(Object, **Object**)* instead
of *Collator.compare(String, **String**)* doesn't prove to be any better
either, because it does *toString()* anyway, regressing the performance in
a similar fashion. Ideally, an implementation such as
*Collator.compare(Character, **Character**)* could do the trick, however
only under the condition that it would *not allocate* a new *String* for
the two arguments. This would allow traversing *left* and *right* strings
and comparing individual characters just by using *String.charAt* (with no
extra *String* allocation whatsoever).

However, I don't believe there is currently anything like
*Collator.compare(**Character**, **Character**)* that works exactly like
this. So for now, I'm trying to implement this functionality using
*CollationKey*.

[icu-support] CollationKey for efficient collation-aware in-place substring comparison

Open Source C/C++/Java libraries from Unicode

[icu-support] CollationKey for efficient collation-aware in-place substring comparison