The Lemur Project / Bugs / #188 Repository::processTerm breaks idempotency of k-stem

#188 Repository::processTerm breaks idempotency of k-stem

Milestone: v5.x

Status: wont-fix

Owner: David Fisher

Labels: Indri (93)

Priority: 5

Updated: 2012-10-19

Created: 2012-05-17

Creator: Anonymous

Private: No

If a word is k-stemmed into a term that is in the stopwords list, then processTerm(word) != processTerm(processTerm(stemmedword)). This behaviour causes unexpected behaviour in a very certain case:

If a document has the word "having", but "have" is in the index's stopword's list, the following code will yield unexpected behaviour. (Excuse the mangled syntax.)

// get the document vector for docno 1 (which has the word "having" in it)
vectors = env.documentVectors( new DOCID_T[] {1} )
DocumentVector* vec = vectors[0];

// iterate through stems in doc vector and print the document frequency of each
for(int i = 0; i < vec.positions().size(); i++) {
std::cout << env.documentCount(vec->stems()[i]) << std::endl;
}

In the above code, when you reach the stem "have" (the stemmed version of "having"), 0 will be output, because processTerm will deem "have" as a stopword. Perhaps processTerm should first check if there is an inverted list for the term before filtering out stopwords, or not stop the term at all?

Attached is the a small document which can be used in conjunction with the parameter file below to create an index that will exhibit this behaviour. Another way to see the behaviour without writing code is running "dumpindex bad_kstem_index tp have" (in contrast to "dumpindex bad_kstem_index tp having").

I am running the most recent version of indri, 5.2.

<parameters>
<index>/home/yubink/bad_kstem</index>
<corpus>
/home/yubink/bad_kstem.txt
<class>trectext</class>
</corpus>
<field> <name>TEXT</name> </field>
<stemmer><name>Krovetz</name></stemmer>
<stopper>
<word>have</word>
</stopper>
</parameters>

Discussion

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2012-05-17

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

bad_kstem.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2012-05-17

Sorry, processTerm(word) != processTerm(processTerm(stemmedword)) should be processTerm(word) != processTerm(processTerm(word)).

Last edit: Anonymous 2013-09-08

Sorry, processTerm(word) != processTerm(processTerm(stemmedword)) should be processTerm(word) != processTerm(processTerm(word)).

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Fisher - 2012-05-17

1) processTerm(t) != processTerm(processTerm(t)), nor should it, both for the reason that you cite above, and because the stemmer may well stem a stem to a new stem.

2) When using stemmed terms, such as are obtained in the DocumentVector, you should use the stemmed counts APIS, documentStemCount, stemFieldCount, and stemCount.

INT64 indri::api::QueryEnvironment::documentStemCount( const std::string& term )

which are intended for use with terms that have already been stemmed and stopped.

Your code above should be:

// get the document vector for docno 1 (which has the word "having" in it)
vectors = env.documentVectors( new DOCID_T[] {1} )
DocumentVector* vec = vectors[0];

// iterate through stems in doc vector and print the document frequency of each
for(int i = 1; i < vec->stems().size(); i++) { // skip the OOV term
std::cout << env.documentCount(vec->stems()[i]) << std::endl;
}

Note that in the general case, the size of positions will not be equal to the size of stems.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Fisher - 2012-10-19

status: pending --> wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Repository::processTerm breaks idempotency of k-stem

Search engine and data mining applications and ClueWeb datasets.

Group

Searches

Help

#188 Repository::processTerm breaks idempotency of k-stem

I am running the most recent version of indri, 5.2.

Discussion