OS: ubuntu 10.04 x64
I find the kstem implementation in KrovetzStemmer.cpp may not work correctly for many cases. For example: "abacus" and "abandoned".
If I understand the source code correctly (sorry I am not a c++ programmer), the process for the two cases should be:
1. Line 1062: check whether abacus is in the dictionary; if yes, then get out of the while loop without checking for plurals and so on.
2. In Line 1390, I find "abacus" is one of the headwords and will be added to the dictionary (Line 22949-22951). Thus, it should go out of the while loop and finally return "abacus" as the stem. But I find the program will continue to check for plurals at Line 1063.
1. Before Line 1065, the word should still be "abandoned";
2. Then, in past_tense(): it will go to the loop at Line 272-324;
3. First, it will try to remove "d" and check whether "abandone" is in the dictionary and not one of the exceptions (Line 273-279);
4. Then, it will try to remove "ed" and check whether "abandon" is in the dictionary (281-285). Because "abandon" is one of the headwords (Line 1392) and have been added to the dictionary, it should stop checking and return at 279. But I find the program will continue running until finally return "abandone" by default (Line 317-323).
I guess this program may comes from (Line 104-116) getdep(char*).
I am not sure whether it is because of any error in my C++ program that outputs the results (I've attached the program), or the progress of compilation. But I did followed the readme and copied Makefile.app to compile the progarm.
I've tested this using an alternative method: I created a document that contains the two words "abacus" and "abandoned"; then, I use IndriBuildIndex to index the document (using krovetz stemmer); finally, I use java to read the indexed document but still find the two words are indexed as "abacu" and "abandone".
Or did I misunderstand parts of the source code?