[htdig-dev] Re: Several questions...

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> That could have its own problems.  If they are labelled -1, -2, ...
> then phrase searching would have to match *backwards* for negative
> numbers.  Then if true positions overflowed into negative numbers,
> ...very negative number, then it is essentially starting from a very
> large (unsigned) location.  Thoughts?

It's pretty easy to come up with a n-bit integer that should be long 
enough for practical purposes. 2^16 = 65,536 which is probably still a 
bit too small for the maximum number of words in a document. But 2^24 
gives us a good 16-million words, which is good enough for War and 
Peace. (I'm checking at the moment.)

> Regarding flexibility, we could make  htsearch  treat words separated
> by "invalid" puctuation (but no spaces) as a phrase, and make the
> default  valid_punctuation  empty.  That way people who want the
> current functionality can have it (except queries where words are not
> separated by spaces but *should* match those words separately?) but
> the default would be less buggy for phrase searches.

Sounds sensible to me--but I think we need more than one or two voices 
on this. But just to make sure I'm clear on what you want to do...

status-quo -> status (location 0) + quo (location 1)

And there's no entry for "statusquo"

>> For some people, punctuation has meaning. Let's say we have part
>> numbers or dates. "3/24/03" isn't really the same as "32403" and
>> I'm not sure the phrase search works well either.
>
> Ah, yes.  All three would be too short to be indexed...  But isn't
> that what  extra_word_characters  is for?

Yes. But my point is that we should eventually work out a WordToken 
class or something that wraps up all these attributes and can be 
generalized for Unicode-type issues.

-Geoff