Re: [icu-support] Piecemeal string comparison

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Let me restate what I think you are saying.

You want to compare two perhaps very large strings, according to the
conventions for a given language, to see whether or not they are equal. You
don't care about the ordering (which is less) if they are not equaI.

If you are willing to live with some restrictions on the collation
parameters, you can handle this by using a CollationElementIterator. Here is
a sample code fragment for something that gets processed collation elements.
You can then just compare the ones you get for each string until you get a
difference or run out (CollationElementIterator.DONE).

Note: Vladimir is looking at adding something like this to ICU in the
future.

  /**
   * This really ought to be just methods on CollationElementIterator.
   */
  static class CollationElementIterator2 {
    private CollationElementIterator keyIterator;
    private int strengthMask;
    private int variableTop;
    private int offsetBefore;
    private int offsetAfter;

    public int getOffsetBefore() {
      return offsetBefore;
    }
    public int getOffsetAfter() {
      return offsetAfter;
    }
    public CollationElementIterator2 reset() {
      keyIterator.reset();
      return this;
    }
    public CollationElementIterator2 setOffset(int offset) {
      keyIterator.setOffset(offset);
      return this;
    }
    public CollationElementIterator2 setText(String source) {
      keyIterator.setText(source);
      return this;
    }
    CollationElementIterator2(RuleBasedCollator collator) {
      // gather some information that we will need later
      strengthMask = 0xFFFF0000;
      variableTop = !collator.isAlternateHandlingShifted() ? -1 :
collator.getVariableTop() | 0xFFFF;
      // this needs to be fixed a bit for case-level, etc.
      switch (collator.getStrength()) {
        case Collator.PRIMARY:
          strengthMask = 0xFFFF0000;
          break;
        case Collator.SECONDARY:
          strengthMask = 0xFFFFFF00;
          break;
        default:
          strengthMask = 0xFFFFFFFF;
        break;
      }
      keyIterator = collator.getCollationElementIterator("");
    }
    /**
     * This should be a method on CollationElementIterator. Returns next
     * non-zero collation element, setting indexBefore, indexAfter. Should
also
     * process shifted and strength, masking as needed. If a collation
element
     * has a continuation, then the indexAfter = indexBefore, for example,
if
     * [CE1,CE2] form a single collation element for the characters between
     * native indexes 5 and 8, (C2 being a continuation, then the result of
two
     * calls to nextProcessed would be [CE1, 5, 5] then [CE1, 5,8].<p>
     * previousProcessed would do similar things, backwards.
     *
     */
    int nextProcessed() {
      while (true) {
        offsetBefore = keyIterator.getOffset();
        int collationElement = keyIterator.next();
        if (collationElement != CollationElementIterator.NULLORDER) {

          // note: the collation element iterator ought to give us processed
values, but it doesn't
          // so we have to simulate that.
          collationElement &= strengthMask; // mask to only the strengths we
have
          // check for shifted.
          // TODO This is not exactly right, and we also need to eject any
following combining marks,
          // so fix later.
          if (collationElement < variableTop && collationElement > 0xFFFF) {
            continue;
          }
          if (collationElement == 0) {
            continue;
          }

        }
        offsetAfter = keyIterator.getOffset();
        return collationElement;
      }
    }

    int previousProcessed() {
      while (true) {
        offsetAfter = keyIterator.getOffset();
        int collationElement = keyIterator.previous();
        if (collationElement != CollationElementIterator.NULLORDER) {

          // note: the collation element iterator ought to give us processed
values, but it doesn't
          // so we have to simulate that.
          collationElement &= strengthMask; // mask to only the strengths we
have
          // check for shifted.
          // TODO This is not exactly right, and we also need to eject any
following combining marks,
          // so fix later.
          if (collationElement < variableTop && collationElement > 0xFFFF) {
            continue;
          }
          if (collationElement == 0) {
            continue;
          }

        }
        offsetBefore = keyIterator.getOffset();
        return collationElement;
      }
    }
  }

On 6/21/07, Doug Doole <do...@ca...> wrote:
>
>
> I need to use ICU collators to compare potentially huge documents (100s of
> GB is not impossible) to determine if they are equal. Inequality isn't
> important - I just need to know if they are equal.
>
> Obviously, materializing the entire document into memory isn't an option.
> We can also assume that 99.99999...% of the time there will be a base
> character difference early in the pair of documents. So is there a
> practical way to do this?
>
> I was pondering implementing a subclass of UnicodeString that could pull
> the string into memory piecemeal and then feed the codeunits to Collator.
> However, if Collator is simply going to build up the entire sort key
> before
> doing any comparison then this doesn't buy me much. If Collator does have
> an early out, does it matter if I call Collator::compare() vs.
> Collator::equal()? Would I have to implement the entire UnicodeString
> interface or could I leave most of the member functions as dummies? (And
> which member functions would I need to focus on?)
>
> Any advice would be greatly appreciated.
> - Doug
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> icu-support mailing list - icu...@li...
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
>

-- 
Mark

Re: [icu-support] Piecemeal string comparison

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] Piecemeal string comparison