|
From: Neal R. <ne...@ri...> - 2002-12-06 20:12:32
|
> Usually, for this purpose, the stem part of words should be enough, as > it is more powerful in representing the meaning of a word. Yes, but there is a tradeoff... Using stemming negatively impacts precision while improving generalization. In the information retrieval research community there is disagrement about the utility of stemming. Here's the basic feeling: If the document index is LARGE, don't use stemming because it is important to be precise so that you are 'drinking from a firehose' by getting back to many results. If the document index is SMALL, use stemming because the increased word generalization helps avoid queries with no results. Most large internet search engines don't use stemming for this reason... you would get back MANY more results with less precision. > Do you have any exact statistics showing the difference in storage of > the two indexes? As Lachlan say, storing both could be a bit extravagant > and, again, in my opinion could lead out of the tracks. Also, the Currently our index stores one row per word-document-location. As discussed before, this is inefficient and we've got a plan to change it. Moving to stemming without changing this would result in NO efficiency improvement in WordDB. At the point we change it to be [word-document]/[loc1,loc2,...locn] We'll have a significant WordDB efficiency savings. Even after this change we will still have a row+ PER UNIQUE WORD. Stemming would provide an efficiency improvement by having one row PER STEM. According to the literature, if you go with a stemmed index exclusively, the index efficiency goes up by ABOUT 20-30%. This estimate is very data and language dependent. I research and implement this kind of stuff at work... I'd be happy to post links to a couple research papers if people are interested. > problem is less important in an english language; I don't know any other > languages except italian and latin languages, but they are for sure more > complex, having different affix rules and lots more of different tenses. Here's the link to Martin Porter's (BSD Licensed) stemmers: http://snowball.tartarus.org/ There are 10 languages supported. Some languages are very difficult to stem, such a finnish... but in general stemming is beneficial for word generalization. ------- I agree with Geoff in that we don't want to go with stemming exclusively.. Here's a proposal for 'intelligent stemming' in HtDig: 1. Fix index efficiency. 2. Add a configuration switch to disable stemming ;-) 3. Implement the stemming algorithm to ADD additional rows to the index with stemmed versions of the words (with a row flag to signify this). 4. During result ranking we rank the results with an algorithm like this: If num documents is LARGE unstemmed rows are 80%, stemmed rows are 20% of the 'score' If num documents is MEDIUM unstemmed rows are 60%, stemmed rows are 40% of the 'score' If num documents is SMALL unstemmed rows are 30%, stemmed rows are 70% of the 'score' These percentages are gut-feeling ball-park numbers based on my experience and research on the topic. The meaning of Large/Medium/Small needs to be decided. It also leans toward preferring unstemmed words because of their higher 'precision'. This system does add duplicate rows in a sense to the index. Here's an example traveling -> travel travel -> travel travels -> travel traveler -> travel traveled -> travel Document Word StemFlag Locations 20 traveling 0 24 36 110 20 travel 0 52 98 220 20 travels 0 10 75 340 20 traveler 0 13 180 20 traveled 0 200 20 travel 1 10 13 24 36 52 75 98 110 180 200 220 340 The last row is a kind of duplicate, and this impacts efficiency negatively, but does get us some increased word generalization. ------- The other idea thrown around is to improve the 'fuzzy endings' algorithm. I agree that this needs doing, and Porter's stemmers will give us many ideas on how to do this. Note however that this is an unproven and less studied technique in NLP & IR circles, so we would be blazing some new ground... which tends to be a slow process to get correct. If we do both we can leave it up to users to 'tune' HtDig to their liking. Flexibility is good. I also don't support doing anything about stemming until we fix the index (which I'm working on). It will negatively impact the size too much for large indexes. FEEDBACK PLEASE!! Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |