RE: [Exist-open] full-text search problem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Thanks for your suggestions. I additionally had a look at Apache's lucene to
see how they solved it and I found a grammar based tokenizer
(StandardTokenizer) implemented with JavaCC. It uses a number of rules to
distinguish between various types of numbers, words with apostrophs,
abbreviations and so on, but you need a kind of lookahead parser to do this.
I think it would be great to have these rules configurable by the user,
which would not be possible using JavaCC or ANTLR. Perhaps one possiblitiy
would be to allow users to provide their own tokenizer class (as in lucene).

Unfortunately I'm already quite booked out for the rest of the weekend, so
if anyone is willing to work on the problem and replace
org.exist.storage.WordTokenizer with a more reasonable solution you're very
welcome.

Wolfgang

RE: [Exist-open] full-text search problem

eXist-db is a feature rich Open Source native XML database

RE: [Exist-open] full-text search problem