|
From: Gabriele B. <g.b...@co...> - 2002-12-09 07:50:03
|
Il ven, 2002-12-06 alle 21:09, Neal Richter ha scritto: > According to the literature, if you go with a stemmed index exclusively= , > the index efficiency goes up by ABOUT 20-30%. This estimate is very data > and language dependent. >=20 > I research and implement this kind of stuff at work... I'd be happy to > post links to a couple research papers if people are interested. Yes, please do Neal. I am really interested, and what you said so far is interesting as well! :-) > Here's a proposal for 'intelligent stemming' in HtDig: >=20 > 1. Fix index efficiency. Yep > 2. Add a configuration switch to disable stemming ;-) Good. > 3. Implement the stemming algorithm to ADD additional rows to the inde= x > with stemmed versions of the words (with a row flag to signify > this). Perfect > 4. During result ranking we rank the results with an algorithm like > this: >=20 > If num documents is LARGE > unstemmed rows are 80%, stemmed rows are 20% of the 'score' >=20 > If num documents is MEDIUM > unstemmed rows are 60%, stemmed rows are 40% of the 'score' >=20 > If num documents is SMALL > unstemmed rows are 30%, stemmed rows are 70% of the 'score' I like it, even though I think that giving users the chance to set those values somehow, by choosing a more general or specific index wouldn't be bad in my opinion. > I also don't support doing anything about stemming until we fix the index > (which I'm working on). It will negatively impact the size too much for > large indexes. I agree ... Babysteps. :-) Thanks for your message. Please can you point us some reference or resources to read. I'd love that! Ciao and thanks, -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |