Re: [htdig-dev] stemming

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ciao guys,

> Indexing the stems is a good suggestion.  It would=20
> certainly give faster searching.  If it replaced the=20
> unstemmed inverted file then it would also save on storage=20
> requirements, but it would mean we couldn't search on the=20
> unstemmed version (if that is of concern).  Alternatively,=20
> indexing both the stemmed and unstemmed versions may be a=20
> bit extravagant...

IMHO, the ultimate goal for a search process is to get a set of document
satisfying a semantic criteria, better a context criteria. If we can do
more in this direction, that'd be great.

Usually, for this purpose, the stem part of words should be enough, as
it is more powerful in representing the meaning of a word.

Do you have any exact statistics showing the difference in storage of
the two indexes? As Lachlan say, storing both could be a bit extravagant
and, again, in my opinion could lead out of the tracks. Also, the
problem is less important in an english language; I don't know any other
languages except italian and latin languages, but they are for sure more
complex, having different affix rules and lots more of different tenses.

You know better than me that this would lead us far away from user's
first goal: the search of documents about 'something'.

As Geoff wisely suggest though, we could implement a different fuzzy
algorithm for the 'Porter stemming' which builds a new index (a stemmed
one).

If you have some reference and suggestion, I'd be happy to offer coding
it; that'd be a great chance for me to get into the 'word' module of the
new ht://Dig system.

Geoff, Gilles, Neal and Lachlan, I expect some news from you about this!
:-)

> > I have also been wondering if it is possible to turn off=20
> word-level indexing, to give (much) smaller inverted files=20
> if people don't need phrase searching.  Does anybody know?=20
> That would be a compelling reason to store word attributes=20
> in a pure bit-map format, rather than using the more=20
> compact formats we were discussing recently.

I guess customisation is our goal. In a retrieval phase, we'd want to
store almost *anything* we can, then maybe with different fuzzy
algorithms build alternative indexes (smaller or bigger, depending on
users' settings).

So ... I vote for an additional algorithm as Geoff suggests! :-)

Ciao ciao
-Gabriele
--=20
Gabriele Bartolini - Web Programmer
Comune di Prato - Prato - Tuscany - Italy
g.b...@co... | http://www.comune.prato.it
> find bin/laden -name osama -exec rm {} ;