From: Wolfgang M. <sb...@we...> - 2002-02-15 17:55:28
|
Thanks for your suggestions. I additionally had a look at Apache's lucene to see how they solved it and I found a grammar based tokenizer (StandardTokenizer) implemented with JavaCC. It uses a number of rules to distinguish between various types of numbers, words with apostrophs, abbreviations and so on, but you need a kind of lookahead parser to do this. I think it would be great to have these rules configurable by the user, which would not be possible using JavaCC or ANTLR. Perhaps one possiblitiy would be to allow users to provide their own tokenizer class (as in lucene). Unfortunately I'm already quite booked out for the rest of the weekend, so if anyone is willing to work on the problem and replace org.exist.storage.WordTokenizer with a more reasonable solution you're very welcome. Wolfgang |