|
From: Gabriele B. <g.b...@co...> - 2002-12-04 08:12:16
|
Ciao guys, > Indexing the stems is a good suggestion. It would=20 > certainly give faster searching. If it replaced the=20 > unstemmed inverted file then it would also save on storage=20 > requirements, but it would mean we couldn't search on the=20 > unstemmed version (if that is of concern). Alternatively,=20 > indexing both the stemmed and unstemmed versions may be a=20 > bit extravagant... IMHO, the ultimate goal for a search process is to get a set of document satisfying a semantic criteria, better a context criteria. If we can do more in this direction, that'd be great. Usually, for this purpose, the stem part of words should be enough, as it is more powerful in representing the meaning of a word. Do you have any exact statistics showing the difference in storage of the two indexes? As Lachlan say, storing both could be a bit extravagant and, again, in my opinion could lead out of the tracks. Also, the problem is less important in an english language; I don't know any other languages except italian and latin languages, but they are for sure more complex, having different affix rules and lots more of different tenses. You know better than me that this would lead us far away from user's first goal: the search of documents about 'something'. As Geoff wisely suggest though, we could implement a different fuzzy algorithm for the 'Porter stemming' which builds a new index (a stemmed one). If you have some reference and suggestion, I'd be happy to offer coding it; that'd be a great chance for me to get into the 'word' module of the new ht://Dig system. Geoff, Gilles, Neal and Lachlan, I expect some news from you about this! :-) > > I have also been wondering if it is possible to turn off=20 > word-level indexing, to give (much) smaller inverted files=20 > if people don't need phrase searching. Does anybody know?=20 > That would be a compelling reason to store word attributes=20 > in a pure bit-map format, rather than using the more=20 > compact formats we were discussing recently. I guess customisation is our goal. In a retrieval phase, we'd want to store almost *anything* we can, then maybe with different fuzzy algorithms build alternative indexes (smaller or bigger, depending on users' settings). So ... I vote for an additional algorithm as Geoff suggests! :-) Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |