From: Lachlan Andrew <lha@ee...> - 2002-12-11 01:53:21
On Wed, 11 Dec 2002 11:34, you wrote:
> > three-value flag:
> > index unstemmed words, index stems, index both?
> Sure. Geoff wants a default of 'unstemmed words'
> (the current method), which I agree with.
I agree that compatibility should be the default.
> > - The format you describe sounds like a "half-inverted"
> > file -- listing locations *within* a document by
> > word, but listing *document* locations by document. Is
> > that correct?
> In the proposed index only the word+document are the
> 'key', the remaining parts are in the 'value'. I'm not
> sure what you mean by 'document locations' here.. please
By "listing *document* location by document", I essentially
meant that the document ID was (part of) the key.
Essentially my question was: Why is the document ID part
of the key? Isn't searching more efficient if you don't
need to scan through all documents for a word which only
occurs in 1% of them (which a good query term would)? Is
there some operation which needs words to be looked up by
document? (This was probably all discussed when the new
format was chosen. Feel free to point me to an archive
instead of answering every question.)
I don't quite understand what you said about morphological
analysis, but I'll do some reading before asking too many
Thanks for your explanations.
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
Get latest updates about Open Source Projects, Conferences and News.