From: Mark D. <mar...@ic...> - 2007-06-21 23:15:45
|
Let me restate what I think you are saying. You want to compare two perhaps very large strings, according to the conventions for a given language, to see whether or not they are equal. You don't care about the ordering (which is less) if they are not equaI. If you are willing to live with some restrictions on the collation parameters, you can handle this by using a CollationElementIterator. Here is a sample code fragment for something that gets processed collation elements. You can then just compare the ones you get for each string until you get a difference or run out (CollationElementIterator.DONE). Note: Vladimir is looking at adding something like this to ICU in the future. /** * This really ought to be just methods on CollationElementIterator. */ static class CollationElementIterator2 { private CollationElementIterator keyIterator; private int strengthMask; private int variableTop; private int offsetBefore; private int offsetAfter; public int getOffsetBefore() { return offsetBefore; } public int getOffsetAfter() { return offsetAfter; } public CollationElementIterator2 reset() { keyIterator.reset(); return this; } public CollationElementIterator2 setOffset(int offset) { keyIterator.setOffset(offset); return this; } public CollationElementIterator2 setText(String source) { keyIterator.setText(source); return this; } CollationElementIterator2(RuleBasedCollator collator) { // gather some information that we will need later strengthMask = 0xFFFF0000; variableTop = !collator.isAlternateHandlingShifted() ? -1 : collator.getVariableTop() | 0xFFFF; // this needs to be fixed a bit for case-level, etc. switch (collator.getStrength()) { case Collator.PRIMARY: strengthMask = 0xFFFF0000; break; case Collator.SECONDARY: strengthMask = 0xFFFFFF00; break; default: strengthMask = 0xFFFFFFFF; break; } keyIterator = collator.getCollationElementIterator(""); } /** * This should be a method on CollationElementIterator. Returns next * non-zero collation element, setting indexBefore, indexAfter. Should also * process shifted and strength, masking as needed. If a collation element * has a continuation, then the indexAfter = indexBefore, for example, if * [CE1,CE2] form a single collation element for the characters between * native indexes 5 and 8, (C2 being a continuation, then the result of two * calls to nextProcessed would be [CE1, 5, 5] then [CE1, 5,8].<p> * previousProcessed would do similar things, backwards. * */ int nextProcessed() { while (true) { offsetBefore = keyIterator.getOffset(); int collationElement = keyIterator.next(); if (collationElement != CollationElementIterator.NULLORDER) { // note: the collation element iterator ought to give us processed values, but it doesn't // so we have to simulate that. collationElement &= strengthMask; // mask to only the strengths we have // check for shifted. // TODO This is not exactly right, and we also need to eject any following combining marks, // so fix later. if (collationElement < variableTop && collationElement > 0xFFFF) { continue; } if (collationElement == 0) { continue; } } offsetAfter = keyIterator.getOffset(); return collationElement; } } int previousProcessed() { while (true) { offsetAfter = keyIterator.getOffset(); int collationElement = keyIterator.previous(); if (collationElement != CollationElementIterator.NULLORDER) { // note: the collation element iterator ought to give us processed values, but it doesn't // so we have to simulate that. collationElement &= strengthMask; // mask to only the strengths we have // check for shifted. // TODO This is not exactly right, and we also need to eject any following combining marks, // so fix later. if (collationElement < variableTop && collationElement > 0xFFFF) { continue; } if (collationElement == 0) { continue; } } offsetBefore = keyIterator.getOffset(); return collationElement; } } } On 6/21/07, Doug Doole <do...@ca...> wrote: > > > I need to use ICU collators to compare potentially huge documents (100s of > GB is not impossible) to determine if they are equal. Inequality isn't > important - I just need to know if they are equal. > > Obviously, materializing the entire document into memory isn't an option. > We can also assume that 99.99999...% of the time there will be a base > character difference early in the pair of documents. So is there a > practical way to do this? > > I was pondering implementing a subclass of UnicodeString that could pull > the string into memory piecemeal and then feed the codeunits to Collator. > However, if Collator is simply going to build up the entire sort key > before > doing any comparison then this doesn't buy me much. If Collator does have > an early out, does it matter if I call Collator::compare() vs. > Collator::equal()? Would I have to implement the entire UnicodeString > interface or could I leave most of the member functions as dummies? (And > which member functions would I need to focus on?) > > Any advice would be greatly appreciated. > - Doug > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support > -- Mark |