[htdig-dev] Re: stemming

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 11 Dec 2002 11:34, you wrote:

> > three-value flag:
> >   index unstemmed words, index stems, index both?
>
>     Sure.  Geoff wants a default of 'unstemmed words'
> (the current method), which I agree with.

I agree that compatibility should be the default.

> > - The format you describe sounds like a "half-inverted"
> >   file -- listing locations *within* a document by
> > word, but listing *document* locations by document.  Is
> > that correct?
>
>     In the proposed index only the word+document are the
> 'key', the remaining parts are in the 'value'.  I'm not
> sure what you mean by 'document locations' here.. please
> clarify!

By "listing *document* location by document", I essentially 
meant that the document ID was (part of) the key.

Essentially my question was:  Why is the document ID part 
of the key?  Isn't searching more efficient if you don't 
need to scan through all documents for a word which only 
occurs in 1% of them (which a good query term would)?  Is 
there some operation which needs words to be looked up by 
document?  (This was probably all discussed when the new 
format was chosen.  Feel free to point me to an archive 
instead of answering every question.)

I don't quite understand what you said about morphological 
analysis, but I'll do some reading before asking too many 
questions :)

Thanks for your explanations.
Lachlan

-- 
Lachlan Andrew  Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg		CRICOS Provider Code
University of Melbourne, Victoria, 3010  AUSTRALIA	00116K