Re: [Exist-open] Mysterious Behavior with Unicode Searching

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

David Mundie a =C3=A9crit :

 > Here the search pattern is "u092c u093f u0932" (or "bil" in
 > transliteration).

Including the famous DEVANAGARI VOWEL SIGN I ;-)

 > The problem is that SimpleTokenizer uses isLetter as a test for word
 > matching, but the isLetter function returns false for Marks, which is
 > what Unicode says Devanagari vowel signs are. The result is that
 > SimpleTokenizer treats this search pattern as two tokens ("b" and "l")
 > rather than one.

Correct.

 > To work around this, I wrote a crude is_mark function that only handle=
s
 > Devanagari vowels:
 >
 > private final boolean is_mark(char ch) {
 > return (ch > '\u093d' && ch < '\u094c');
 > }

Mmmh... according to :
http://www.fileformat.info/info/unicode/char/093F/index.htm

Character.getType() return 8, which is COMBINING_SPACING_MARK=20
(http://java.sun.com/j2se/1.4.2/docs/api/constant-values.html#java.lang).

So... I wonder if the tokenizer shouldn't be made more generic and=20
consider Character.getType() rather than the is* methods...

 > However, at that point things get very mysterious.

It is, indeed :-) I would also be extremely interested by an explanation=20
on this topic.

> Any insight would be consumed voraciously.

Charming :-)

Cheers,

--=20
Pierrick Brihaye, informaticien
Service r=C3=A9gional de l'Inventaire
DRAC Bretagne
mailto:pie...@cu...
+33 (0)2 99 29 67 78

Re: [Exist-open] Mysterious Behavior with Unicode Searching

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Mysterious Behavior with Unicode Searching