From: Neal R. <ne...@ri...> - 2002-12-10 00:18:22
|
> - Given the flag to disable stemming, what is the > dissadvantage of simply making it a three-value flag: > index unstemmed words, index stems, index both? Sure. Geoff wants a default of 'unstemmed words' (the current method), which I agree with. > - The format you describe sounds like a "half-inverted" > file -- listing locations *within* a document by word, but > listing *document* locations by document. Is that > correct? In the proposed index only the word+document are the 'key', the remaining parts are in the 'value'. I'm not sure what you mean by 'document locations' here.. please clarify! > - You said that the approach currently taken by fuzzy > endings is uncharted waters. I assume you are talking > about the approach of simply creating a disjunction of > the derived words. What is hard to "get right" about > that? In terms of the documents returned, it sounds the > same as what you have proposed. In terms of > implementation, it sounds like what 'fuzzy endings' does > now, except for fixing the stemming. Our 'fuzzy endings' algorithm is in the class of "morphological analysis" algorithms. These algorithms are frequently studied, and there are many good packages. Most of them cost big money and are very complex, very language specific and took many years of research.. and are not open source. Morphological analysis is pretty cutting edge in NLP, and still mostly unsolved. What is hard to 'get right' is the general idea of generating correct variants of a given word. The stemming algorithms are quite complex with many rules for generating stems. My gut feeling is that the number of rules in the 'fuzzy endings' algorithm needs to be on par with or exceed the same-language stemmer. 1. Stemming is a known quantity with known performance 2. We have 10+ languages available NOW Morphological analysis is a promising approach as it's possible to generate better endings and avoid the situation you detail below. It would take alot of work to make this algorithm as good or exceed the generalization ability of the stemmers. I would actually encourage whoever has worked on the algorithm to consider writing an academic paper for publication on it. There would need to be a comparative study done on it vs stemming.. but if you can show that it outperforms stemming in IR for precision/recall great! SEE BELOW FOR SOME REFERENCES! The AI researcher in me wants to explore the fuzzy-endings algorithm.. the conservative software engineer side wants to go with proven IR & NLP techniques first. > - With stemming in general, what is done about negating > affixes? If I searched for 'mercy', I wouldn't want > results about 'merciless' (although I would want results > about 'merciful'). That is part of the downside of stemming. The hope is that the combination of unstemmed + stemmed words in the index and combined in the score would get the correct result most of the time. Thanks! REFERENCES: David A. Hull Stemming Algorithms A Case Study for Detailed Evaluation (1996) http://citeseer.nj.nec.com/hull96stemming.html David A. Hull, Gregory Grefenstette A Detailed Analysis of English Stemming Algorithms (1996) http://citeseer.nj.nec.com/hull96detailed.html Wessel Kraaij Viewing Stemming as Recall Enhancement (1996) http://citeseer.nj.nec.com/kraaij96viewing.html Wessel Kraaij, Rene Pohlmann Using Linguistic Knowledge in Information Retrieval (1996) http://citeseer.nj.nec.com/kraaij96using.html There is also this frequently cited paper which I can't find on the web. Harman, Donna (1991). How Effective is Suffixing? Journal of the American Society for Information Science, 42(1), 7-15. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |