From: Pierrick B. <pie...@cu...> - 2004-05-26 10:23:25
|
Hi, David Mundie a =C3=A9crit : > Here the search pattern is "u092c u093f u0932" (or "bil" in > transliteration). Including the famous DEVANAGARI VOWEL SIGN I ;-) > The problem is that SimpleTokenizer uses isLetter as a test for word > matching, but the isLetter function returns false for Marks, which is > what Unicode says Devanagari vowel signs are. The result is that > SimpleTokenizer treats this search pattern as two tokens ("b" and "l") > rather than one. Correct. > To work around this, I wrote a crude is_mark function that only handle= s > Devanagari vowels: > > private final boolean is_mark(char ch) { > return (ch > '\u093d' && ch < '\u094c'); > } Mmmh... according to : http://www.fileformat.info/info/unicode/char/093F/index.htm Character.getType() return 8, which is COMBINING_SPACING_MARK=20 (http://java.sun.com/j2se/1.4.2/docs/api/constant-values.html#java.lang). So... I wonder if the tokenizer shouldn't be made more generic and=20 consider Character.getType() rather than the is* methods... > However, at that point things get very mysterious. It is, indeed :-) I would also be extremely interested by an explanation=20 on this topic. > Any insight would be consumed voraciously. Charming :-) Cheers, --=20 Pierrick Brihaye, informaticien Service r=C3=A9gional de l'Inventaire DRAC Bretagne mailto:pie...@cu... +33 (0)2 99 29 67 78 |