|
From: Lachlan A. <lh...@ee...> - 2002-12-03 05:39:13
|
Greetings Geoff, On Tue, 3 Dec 2002 16:01, Geoff Hutchison wrote: > > I have also been wondering if it is possible to turn > > off word-level indexing, to give (much) smaller > > inverted files if people don't need phrase searching. > > Does anybody know? > > Not at the moment. > > But you lose a lot more than phrase searching. You lose > field-restricted searching. You lose scoring by proximity > (like Google). You lose the ability to score "on the > fly"--not to be discounted since many users wonder why > they change their scoring factors and the results don't > change. Thanks for raising those points. These are all enhancements that came with 3.2.0's database restructure, but I think that only phrase searching actually *needs* word-level inverted files. As I said, document-level indexing is a strong motivation for the word attributes to be pure bitmaps. The index could store the "OR" of each field set for any occurrence, so you could still say "If this word occurs in the title AND that word occurs in a heading". I agree that on-the-fly scoring is the way to go, but again I can't see why it couldn't be done based on the OR of the flags (although I could be missing something). Even (very coarse) proximity searching can be done fairly efficiently by, for example, dividing each document into eight regions and specifying (in one byte) which regions contain the word. I'm trying to avoid the "progress = bloat" phenomenon. Although I don't want to change htDig://'s course, my original interest in it was my aim of all Linux boxes having all their documentation searchable. That is one application which requires a minimal-overhead option, albeit with reduced performance. If I get enthusiastic, I'll look at writing a patch... Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |