Re: [htdig-dev] stemming

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> - Given the flag to disable stemming, what is the
>   dissadvantage of simply making it a three-value flag:
>   index unstemmed words, index stems, index both?

    Sure.  Geoff wants a default of 'unstemmed words' (the current
method), which I agree with.

> - The format you describe sounds like a "half-inverted"
>   file -- listing locations *within* a document by word, but
>   listing *document* locations by document.  Is that
>   correct?

    In the proposed index only the word+document are the 'key', the
remaining parts are in the 'value'.  I'm not sure what you mean by
'document locations' here.. please clarify!

> - You said that the approach currently taken by fuzzy
>   endings is uncharted waters.  I assume you are talking
>   about the approach of simply creating a disjunction of
>   the derived words.  What is hard to "get right" about
>   that?  In terms of the documents returned, it sounds the
>   same as what you have proposed.  In terms of
>   implementation, it sounds like what 'fuzzy endings' does
>   now, except for fixing the stemming.

    Our 'fuzzy endings' algorithm is in the class of "morphological
analysis" algorithms.  These algorithms are frequently studied, and there
are many good packages.  Most of them cost big money and are very complex,
very language specific and took many years of research.. and are not open
source.

    Morphological analysis is pretty cutting edge in NLP, and still mostly
unsolved.

    What is hard to 'get right' is the general idea of generating
correct variants of a given word.  The stemming algorithms are quite
complex with many rules for generating stems.  My gut feeling is
that the number of rules in the 'fuzzy endings' algorithm needs
to be on par with or exceed the same-language stemmer.

    1.  Stemming is a known quantity with known performance
    2.  We have 10+ languages available NOW

    Morphological analysis is a promising approach as it's possible to
generate better endings and avoid the situation you detail below.  It
would take alot of work to make this algorithm as good or exceed the
generalization ability of the stemmers.

    I would actually encourage whoever has worked on the
algorithm to consider writing an academic paper for publication on it.
There would need to be a comparative study done on it vs stemming.. but if
you can show that it outperforms stemming in IR for precision/recall
great!

SEE BELOW FOR SOME REFERENCES!

The AI researcher in me wants to explore the fuzzy-endings algorithm.. the
conservative software engineer side wants to go with proven IR &
NLP techniques first.

> - With stemming in general, what is done about negating
>   affixes?  If I searched for 'mercy', I wouldn't want
>   results about 'merciless' (although I would want results
>   about 'merciful').

    That is part of the downside of stemming.  The hope is that the
combination of unstemmed + stemmed words in the index and combined in the
score would get the correct result most of the time.

Thanks!

REFERENCES:

David A. Hull
Stemming Algorithms A Case Study for Detailed Evaluation (1996)
http://citeseer.nj.nec.com/hull96stemming.html

David A. Hull, Gregory Grefenstette
A Detailed Analysis of English Stemming Algorithms (1996)
http://citeseer.nj.nec.com/hull96detailed.html

Wessel Kraaij
Viewing Stemming as Recall Enhancement (1996)
http://citeseer.nj.nec.com/kraaij96viewing.html

Wessel Kraaij, Rene Pohlmann
Using Linguistic Knowledge in Information Retrieval (1996)
http://citeseer.nj.nec.com/kraaij96using.html

There is also this frequently cited paper which I can't find on the web.
Harman, Donna (1991). How Effective is Suffixing? Journal of the American
Society for Information Science, 42(1), 7-15.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485