|
From: Geoff H. <ghu...@ws...> - 2003-03-04 03:37:49
|
> That could have its own problems. If they are labelled -1, -2, ... > then phrase searching would have to match *backwards* for negative > numbers. Then if true positions overflowed into negative numbers, > ...very negative number, then it is essentially starting from a very > large (unsigned) location. Thoughts? It's pretty easy to come up with a n-bit integer that should be long enough for practical purposes. 2^16 = 65,536 which is probably still a bit too small for the maximum number of words in a document. But 2^24 gives us a good 16-million words, which is good enough for War and Peace. (I'm checking at the moment.) > Regarding flexibility, we could make htsearch treat words separated > by "invalid" puctuation (but no spaces) as a phrase, and make the > default valid_punctuation empty. That way people who want the > current functionality can have it (except queries where words are not > separated by spaces but *should* match those words separately?) but > the default would be less buggy for phrase searches. Sounds sensible to me--but I think we need more than one or two voices on this. But just to make sure I'm clear on what you want to do... status-quo -> status (location 0) + quo (location 1) And there's no entry for "statusquo" >> For some people, punctuation has meaning. Let's say we have part >> numbers or dates. "3/24/03" isn't really the same as "32403" and >> I'm not sure the phrase search works well either. > > Ah, yes. All three would be too short to be indexed... But isn't > that what extra_word_characters is for? Yes. But my point is that we should eventually work out a WordToken class or something that wraps up all these attributes and can be generalized for Unicode-type issues. -Geoff |