From: Lachlan A. <lh...@us...> - 2003-02-28 22:55:33
|
Thanks for your explanations, Geoff :) More questions follow. On Saturday 01 March 2003 04:51, Geoff Hutchison wrote: > > 1. location must be in the range 0-1000? > That's a 3.1-ism. > > > 2. Could we add "meta" information > > at successive locations starting from, say, location 10,000? > > Actually, now that I think about it, a better idea is to use > negative word locations for META information. > As for some other arbitrary > number--we might actually have documents that long (esp. with PDF > indexing). That could have its own problems. If they are labelled -1, -2, ...=20 then phrase searching would have to match *backwards* for negative=20 numbers. Then if true positions overflowed into negative numbers,=20 the phrases wouldn't match. (If such overflow is impossible with =20 n-bit numbers, we could use *unsigned* locations, and count forward=20 from 2^(n-1) for meta information.) If we count *forward* from a=20 very negative number, then it is essentially starting from a very=20 large (unsigned) location. Thoughts? > > 3. With phrase searching, do we still need valid_punctuation?=20 > > For example, "post-doctoral" > > This is a strange example. What if I had a hyphenated word? I don't > know that your "phrase conversion" is the best solution. What we do > need is a flexible "word parser" that addresses some of these > issues. I suppose a key is how often people do phrase searches vs word=20 searches. Optionally-hyphenated words are trouble-prone since the=20 status-quo gives oh-so-many fasle-negatives for non-hyphenated=20 phrase-queries applied to over-hyphenated text... (The suggestion=20 was based on what google does.) Regarding flexibility, we could make htsearch treat words separated=20 by "invalid" puctuation (but no spaces) as a phrase, and make the=20 default valid_punctuation empty. That way people who want the=20 current functionality can have it (except queries where words are not=20 separated by spaces but *should* match those words separately?) but=20 the default would be less buggy for phrase searches. > For some people, punctuation has meaning. Let's say we have part > numbers or dates. "3/24/03" isn't really the same as "32403" and > I'm not sure the phrase search works well either. Ah, yes. All three would be too short to be indexed... But isn't=20 that what extra_word_characters is for? > > 4. Does anybody know what the existing external parsers do about > > words less than the minimum length? > I don't think most external parsers bother with the config file. |