#169 Indri Krovetz stemmer not working correctly

Indri (92)

OS: ubuntu 10.04 x64
Indri: 5.2

I find the kstem implementation in KrovetzStemmer.cpp may not work correctly for many cases. For example: "abacus" and "abandoned".
If I understand the source code correctly (sorry I am not a c++ programmer), the process for the two cases should be:

For "abacus":
1. Line 1062: check whether abacus is in the dictionary; if yes, then get out of the while loop without checking for plurals and so on.
2. In Line 1390, I find "abacus" is one of the headwords and will be added to the dictionary (Line 22949-22951). Thus, it should go out of the while loop and finally return "abacus" as the stem. But I find the program will continue to check for plurals at Line 1063.

For "abandoned":
1. Before Line 1065, the word should still be "abandoned";
2. Then, in past_tense(): it will go to the loop at Line 272-324;
3. First, it will try to remove "d" and check whether "abandone" is in the dictionary and not one of the exceptions (Line 273-279);
4. Then, it will try to remove "ed" and check whether "abandon" is in the dictionary (281-285). Because "abandon" is one of the headwords (Line 1392) and have been added to the dictionary, it should stop checking and return at 279. But I find the program will continue running until finally return "abandone" by default (Line 317-323).

I guess this program may comes from (Line 104-116) getdep(char*).

I am not sure whether it is because of any error in my C++ program that outputs the results (I've attached the program), or the progress of compilation. But I did followed the readme and copied Makefile.app to compile the progarm.

I've tested this using an alternative method: I created a document that contains the two words "abacus" and "abandoned"; then, I use IndriBuildIndex to index the document (using krovetz stemmer); finally, I use java to read the indexed document but still find the two words are indexed as "abacu" and "abandone".

Or did I misunderstand parts of the source code?


  • Comment has been marked as spam. 

    You can see all pending comments posted by this user  here

    Anonymous - 2012-01-08
  • David Fisher

    David Fisher - 2012-01-09

    G++ 4.4+ requires std::hash<std::string> rather than std::hash<const char *> for the unordered map. This is a behavior change from G++ 4.3.x.

    To ship in the 06/2012 release. Update include/indri/KrovetzStemmer.hpp from subversion to get the fix before then.

    Last edit: David Fisher 2013-11-20

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks