Menu

#188 Repository::processTerm breaks idempotency of k-stem

v5.x
wont-fix
Indri (93)
5
2012-10-19
2012-05-17
Anonymous
No

If a word is k-stemmed into a term that is in the stopwords list, then processTerm(word) != processTerm(processTerm(stemmedword)). This behaviour causes unexpected behaviour in a very certain case:

If a document has the word "having", but "have" is in the index's stopword's list, the following code will yield unexpected behaviour. (Excuse the mangled syntax.)

// get the document vector for docno 1 (which has the word "having" in it)
vectors = env.documentVectors( new DOCID_T[] {1} )
DocumentVector* vec = vectors[0];

// iterate through stems in doc vector and print the document frequency of each
for(int i = 0; i < vec.positions().size(); i++) {
std::cout << env.documentCount(vec->stems()[i]) << std::endl;
}

In the above code, when you reach the stem "have" (the stemmed version of "having"), 0 will be output, because processTerm will deem "have" as a stopword. Perhaps processTerm should first check if there is an inverted list for the term before filtering out stopwords, or not stop the term at all?

Attached is the a small document which can be used in conjunction with the parameter file below to create an index that will exhibit this behaviour. Another way to see the behaviour without writing code is running "dumpindex bad_kstem_index tp have" (in contrast to "dumpindex bad_kstem_index tp having").

I am running the most recent version of indri, 5.2.

<parameters>
<index>/home/yubink/bad_kstem</index>
<corpus>
/home/yubink/bad_kstem.txt
<class>trectext</class>
</corpus>
<field> <name>TEXT</name> </field>
<stemmer><name>Krovetz</name></stemmer>
<stopper>
<word>have</word>
</stopper>
</parameters>

Discussion

  • Anonymous

    Anonymous - 2012-05-17

    Sorry, processTerm(word) != processTerm(processTerm(stemmedword)) should be processTerm(word) != processTerm(processTerm(word)).

     

    Last edit: Anonymous 2013-09-08
  • David Fisher

    David Fisher - 2012-05-17

    1) processTerm(t) != processTerm(processTerm(t)), nor should it, both for the reason that you cite above, and because the stemmer may well stem a stem to a new stem.

    2) When using stemmed terms, such as are obtained in the DocumentVector, you should use the stemmed counts APIS, documentStemCount, stemFieldCount, and stemCount.

    INT64 indri::api::QueryEnvironment::documentStemCount( const std::string& term )

    which are intended for use with terms that have already been stemmed and stopped.

    Your code above should be:

    // get the document vector for docno 1 (which has the word "having" in it)
    vectors = env.documentVectors( new DOCID_T[] {1} )
    DocumentVector* vec = vectors[0];

    // iterate through stems in doc vector and print the document frequency of each
    for(int i = 1; i < vec->stems().size(); i++) { // skip the OOV term
    std::cout << env.documentCount(vec->stems()[i]) << std::endl;
    }

    Note that in the general case, the size of positions will not be equal to the size of stems.

     
  • David Fisher

    David Fisher - 2012-10-19
    • status: pending --> wont-fix
     

Log in to post a comment.