From: Lachlan A. <lh...@ee...> - 2002-12-09 01:15:27
|
Greetings Neal, Your suggestion sounds good, especially steps 1 and 2... I have some beginner's questions: - Given the flag to disable stemming, what is the dissadvantage of simply making it a three-value flag: index unstemmed words, index stems, index both? - The format you describe sounds like a "half-inverted" file -- listing locations *within* a document by word, but listing *document* locations by document. Is that correct? - You said that the approach currently taken by fuzzy endings is uncharted waters. I assume you are talking about the approach of simply creating a disjunction of the derived words. What is hard to "get right" about that? In terms of the documents returned, it sounds the same as what you have proposed. In terms of implementation, it sounds like what 'fuzzy endings' does now, except for fixing the stemming. - With stemming in general, what is done about negating affixes? If I searched for 'mercy', I wouldn't want results about 'merciless' (although I would want results about 'merciful'). Thanks! Lachlan On Sat, 7 Dec 2002 07:09, Neal Richter wrote: > I agree with Geoff in that we don't want to go with > stemming exclusively.. > Here's a proposal for 'intelligent stemming' in HtDig: > > 1. Fix index efficiency. > 2. Add a configuration switch to disable stemming ;-) > 3. Implement the stemming algorithm to ADD additional > rows to the index with stemmed versions of the words > (with a row flag to signify this). > This system does add duplicate rows in a sense to the > index. > > traveling -> travel > travel -> travel > travels -> travel > traveler -> travel > traveled -> travel > > Document Word StemFlag Locations > > 20 traveling 0 24 36 110 > 20 travel 0 52 98 220 > 20 travels 0 10 75 340 > 20 traveler 0 13 180 > 20 traveled 0 200 > 20 travel 1 10 13 24 36 52 75 > 98 110 180 200 220 340 > > FEEDBACK PLEASE!! -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |