|
From: Lachlan A. <lh...@ee...> - 2002-12-11 01:53:21
|
On Wed, 11 Dec 2002 11:34, you wrote: > > three-value flag: > > index unstemmed words, index stems, index both? > > Sure. Geoff wants a default of 'unstemmed words' > (the current method), which I agree with. I agree that compatibility should be the default. > > - The format you describe sounds like a "half-inverted" > > file -- listing locations *within* a document by > > word, but listing *document* locations by document. Is > > that correct? > > In the proposed index only the word+document are the > 'key', the remaining parts are in the 'value'. I'm not > sure what you mean by 'document locations' here.. please > clarify! By "listing *document* location by document", I essentially meant that the document ID was (part of) the key. Essentially my question was: Why is the document ID part of the key? Isn't searching more efficient if you don't need to scan through all documents for a word which only occurs in 1% of them (which a good query term would)? Is there some operation which needs words to be looked up by document? (This was probably all discussed when the new format was chosen. Feel free to point me to an archive instead of answering every question.) I don't quite understand what you said about morphological analysis, but I'll do some reading before asking too many questions :) Thanks for your explanations. Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |