Re: [htdig-dev] stemming

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> Usually, for this purpose, the stem part of words should be enough, as
> it is more powerful in representing the meaning of a word.

  Yes, but there is a tradeoff...

  Using stemming negatively impacts precision while improving
generalization.

  In the information retrieval research community there is
disagrement about the utility of stemming.

Here's the basic feeling:

  If the document index is LARGE, don't use stemming because it is
important to be precise so that you are 'drinking from a firehose' by
getting back to many results.

  If the document index is SMALL, use stemming because the increased
word generalization helps avoid queries with no results.

  Most large internet search engines don't use stemming for this reason...
you would get back MANY more results with less precision.

> Do you have any exact statistics showing the difference in storage of
> the two indexes? As Lachlan say, storing both could be a bit extravagant
> and, again, in my opinion could lead out of the tracks. Also, the

  Currently our index stores one row per word-document-location.  As
discussed before, this is inefficient and we've got a plan to change it.
Moving to stemming without changing this would result in NO efficiency
improvement in WordDB.

  At the point we change it to be [word-document]/[loc1,loc2,...locn]
  We'll have a significant WordDB efficiency savings.

  Even after this change we will still have a row+ PER UNIQUE WORD.

  Stemming would provide an efficiency improvement by having one row PER
STEM.

  According to the literature, if you go with a stemmed index exclusively,
the index efficiency goes up by ABOUT 20-30%.  This estimate is very data
and language dependent.

  I research and implement this kind of stuff at work...  I'd be happy to
post links to a couple research papers if people are interested.

> problem is less important in an english language; I don't know any other
> languages except italian and latin languages, but they are for sure more
> complex, having different affix rules and lots more of different tenses.

  Here's the link to Martin Porter's (BSD Licensed) stemmers:

  http://snowball.tartarus.org/

  There are 10 languages supported.  Some languages are very difficult
to stem, such a finnish... but in general stemming is beneficial for word
generalization.

-------

  I agree with Geoff in that we don't want to go with stemming
exclusively..

  Here's a proposal for 'intelligent stemming' in HtDig:

  1.  Fix index efficiency.
  2.  Add a configuration switch to disable stemming ;-)
  3.  Implement the stemming algorithm to ADD additional rows to the index
	with stemmed versions of the words (with a row flag to signify
	this).
  4.  During result ranking we rank the results with an algorithm like
      this:

      If num documents is LARGE
         unstemmed rows are 80%, stemmed rows are 20% of the 'score'

      If num documents is MEDIUM
         unstemmed rows are 60%, stemmed rows are 40% of the 'score'

      If num documents is SMALL
         unstemmed rows are 30%, stemmed rows are 70% of the 'score'

   These percentages are gut-feeling ball-park numbers based on my
experience and research on the topic.  The meaning of Large/Medium/Small
needs to be decided.  It also leans toward preferring unstemmed words
because of their higher 'precision'.

   This system does add duplicate rows in a sense to the index.

Here's an example

traveling -> travel
travel -> travel
travels -> travel
traveler -> travel
traveled -> travel

   Document    Word          StemFlag  Locations

   20          traveling     0         24  36 110
   20          travel        0         52  98 220
   20          travels       0         10  75 340
   20          traveler      0         13  180
   20          traveled      0         200
   20          travel        1         10 13 24 36 52 75 98 110 180 200 220 340

The last row is a kind of duplicate, and this impacts efficiency
negatively, but does get us some increased word generalization.

-------

The other idea thrown around is to improve the 'fuzzy endings' algorithm.
I agree that this needs doing, and Porter's stemmers will give us
many ideas on how to do this.

Note however that this is an unproven and less studied technique in NLP &
IR circles, so we would be blazing some new ground... which tends to be a
slow process to get correct.

If we do both we can leave it up to users to 'tune' HtDig to their liking.
Flexibility is good.

I also don't support doing anything about stemming until we fix the index
(which I'm working on).  It will negatively impact the size too much for
large indexes.

FEEDBACK PLEASE!!

Thanks!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485