Re: [htdig-dev] stemming

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Il ven, 2002-12-06 alle 21:09, Neal Richter ha scritto:
>   According to the literature, if you go with a stemmed index exclusively=
,
> the index efficiency goes up by ABOUT 20-30%.  This estimate is very data
> and language dependent.
>=20
>   I research and implement this kind of stuff at work...  I'd be happy to
> post links to a couple research papers if people are interested.

Yes, please do Neal. I am really interested, and what you said so far is
interesting as well! :-)

>   Here's a proposal for 'intelligent stemming' in HtDig:
>=20
>   1.  Fix index efficiency.

Yep

>   2.  Add a configuration switch to disable stemming ;-)

Good.

>   3.  Implement the stemming algorithm to ADD additional rows to the inde=
x
> 	with stemmed versions of the words (with a row flag to signify
> 	this).

Perfect

>   4.  During result ranking we rank the results with an algorithm like
>       this:
>=20
>       If num documents is LARGE
>          unstemmed rows are 80%, stemmed rows are 20% of the 'score'
>=20
>       If num documents is MEDIUM
>          unstemmed rows are 60%, stemmed rows are 40% of the 'score'
>=20
>       If num documents is SMALL
>          unstemmed rows are 30%, stemmed rows are 70% of the 'score'

I like it, even though I think that giving users the chance to set those
values somehow, by choosing a more general or specific index wouldn't be
bad in my opinion.

> I also don't support doing anything about stemming until we fix the index
> (which I'm working on).  It will negatively impact the size too much for
> large indexes.

I agree ... Babysteps. :-)

Thanks for your message. Please can you point us some reference or
resources to read. I'd love that!

Ciao and thanks,
-Gabriele

--=20
Gabriele Bartolini - Web Programmer
Comune di Prato - Prato - Tuscany - Italy
g.b...@co... | http://www.comune.prato.it
> find bin/laden -name osama -exec rm {} ;