2005-11-20 10:09:24 UTC
Hello,
Just found out about tsearch2, and it is very nice: thanks to Oleg and Teodor. This is strictly about tsearch2 and postgreSQL rather than OpenFTS, so please pardon me if this is not the right forum, and thanks for pointing me to the right one.
I am confused with the status of UTF-8 in tsearch2. I read
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2_german_utf8.html by Markus Wollny (and some other articles in czech for instance), but it left me with more questions than answers:
- This document says that the ispell ".dict" file should encoded in UTF-8, however the linked german dictionary appears to be in Iso Latin 1.
- The affixes dictionary (.aff) cannot be encoded in UTF-8, the regexes are not parsed correctly. Should it be encoded in Iso Latin, but include UTF-8 character descriptions? how?
- What about the Snowball stemmers? I've only found ISO Latin code, and the german howto by Markus doesn't say much about this.
I'm using PostgreSQL 8.1, and trying to index an UTF-8 database of French data. lexize and ts_debug always strip accented characters from my tests.
Am I missing something? Would I be better of reencoding my database in Iso Latin?
Thanks in advance for any hint.
marco