|
From: Geoff H. <ghu...@ws...> - 2003-02-28 17:59:50
|
> 1. Why do the documentation for external_parser and the comments > before Retriever::got_word both say that the word location must be > in the range 0-1000? That's a 3.1-ism. The documentation is wrong. Oops. > first word of any *other* entry. Could we add "meta" information at > successive locations starting from, say, location 10,000? Actually, now that I think about it, a better idea is to use negative word locations for META information. This would leave "0" empty and make it impossible to match across the boundary, but fix phrase searching for META words. As for some other arbitrary number--we might actually have documents that long (esp. with PDF indexing). > 3. With phrase searching, do we still need valid_punctuation? For > example, "post-doctoral" currently gets entered as three words at the > *same* location: "post", "doctoral" and "postdoctoral". Would it be > better to convert queries for post-doctoral into the phrase "post This is a strange example. What if I had a hyphenated word? I don't know that your "phrase conversion" is the best solution. What we do need is a flexible "word parser" that addresses some of these issues. After all, Unicode raises even more problems about "what is a word." > "the non-smoker" will match "the smoker", since all the words are > given the same position in the database, but a search for "the non > smoker" won't match "the non-smoker". This also reduces the size of For some people, punctuation has meaning. Let's say we have part numbers or dates. "3/24/03" isn't really the same as "32403" and I'm not sure the phrase search works well either. Yes, reducing the database size would improve speed. Perhaps Gilles can comment on the motivations for the compound-word additions. (I'm having a hard time pulling them up in my mail archive or on the web.) > 4. Does anybody know what the existing external parsers do about words > less than the minimum length? Because they are passed the I don't think most external parsers bother with the config file. Remember that any word should go through HtWordList and this should throw out words that are too long, too short in the bad_words list, etc. -Geoff |
|
From: Lachlan A. <lh...@us...> - 2003-02-28 22:55:33
|
Thanks for your explanations, Geoff :) More questions follow. On Saturday 01 March 2003 04:51, Geoff Hutchison wrote: > > 1. location must be in the range 0-1000? > That's a 3.1-ism. > > > 2. Could we add "meta" information > > at successive locations starting from, say, location 10,000? > > Actually, now that I think about it, a better idea is to use > negative word locations for META information. > As for some other arbitrary > number--we might actually have documents that long (esp. with PDF > indexing). That could have its own problems. If they are labelled -1, -2, ...=20 then phrase searching would have to match *backwards* for negative=20 numbers. Then if true positions overflowed into negative numbers,=20 the phrases wouldn't match. (If such overflow is impossible with =20 n-bit numbers, we could use *unsigned* locations, and count forward=20 from 2^(n-1) for meta information.) If we count *forward* from a=20 very negative number, then it is essentially starting from a very=20 large (unsigned) location. Thoughts? > > 3. With phrase searching, do we still need valid_punctuation?=20 > > For example, "post-doctoral" > > This is a strange example. What if I had a hyphenated word? I don't > know that your "phrase conversion" is the best solution. What we do > need is a flexible "word parser" that addresses some of these > issues. I suppose a key is how often people do phrase searches vs word=20 searches. Optionally-hyphenated words are trouble-prone since the=20 status-quo gives oh-so-many fasle-negatives for non-hyphenated=20 phrase-queries applied to over-hyphenated text... (The suggestion=20 was based on what google does.) Regarding flexibility, we could make htsearch treat words separated=20 by "invalid" puctuation (but no spaces) as a phrase, and make the=20 default valid_punctuation empty. That way people who want the=20 current functionality can have it (except queries where words are not=20 separated by spaces but *should* match those words separately?) but=20 the default would be less buggy for phrase searches. > For some people, punctuation has meaning. Let's say we have part > numbers or dates. "3/24/03" isn't really the same as "32403" and > I'm not sure the phrase search works well either. Ah, yes. All three would be too short to be indexed... But isn't=20 that what extra_word_characters is for? > > 4. Does anybody know what the existing external parsers do about > > words less than the minimum length? > I don't think most external parsers bother with the config file. |
|
From: Geoff H. <ghu...@ws...> - 2003-03-04 03:37:49
|
> That could have its own problems. If they are labelled -1, -2, ... > then phrase searching would have to match *backwards* for negative > numbers. Then if true positions overflowed into negative numbers, > ...very negative number, then it is essentially starting from a very > large (unsigned) location. Thoughts? It's pretty easy to come up with a n-bit integer that should be long enough for practical purposes. 2^16 = 65,536 which is probably still a bit too small for the maximum number of words in a document. But 2^24 gives us a good 16-million words, which is good enough for War and Peace. (I'm checking at the moment.) > Regarding flexibility, we could make htsearch treat words separated > by "invalid" puctuation (but no spaces) as a phrase, and make the > default valid_punctuation empty. That way people who want the > current functionality can have it (except queries where words are not > separated by spaces but *should* match those words separately?) but > the default would be less buggy for phrase searches. Sounds sensible to me--but I think we need more than one or two voices on this. But just to make sure I'm clear on what you want to do... status-quo -> status (location 0) + quo (location 1) And there's no entry for "statusquo" >> For some people, punctuation has meaning. Let's say we have part >> numbers or dates. "3/24/03" isn't really the same as "32403" and >> I'm not sure the phrase search works well either. > > Ah, yes. All three would be too short to be indexed... But isn't > that what extra_word_characters is for? Yes. But my point is that we should eventually work out a WordToken class or something that wraps up all these attributes and can be generalized for Unicode-type issues. -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2003-03-04 04:15:24
|
On Monday, March 3, 2003, at 09:37 PM, Geoff Hutchison wrote:
> for the maximum number of words in a document. But 2^24 gives us a
> good 16-million words, which is good enough for War and Peace. (I'm
> checking at the moment.)
Well, we might get by on less than that:
(These are the Project Gutenberg etext editions of _War and Peace_ and
the King James Bible.)
localhost: ghutchis% wc wrnpc10.txt
67337 566237 3282452 wrnpc10.txt
localhost: ghutchis% wc bible11.txt
114385 822894 4959549 bible11.txt
So I'd guess that 2^20 should be more than enough words. Or does
someone have a nice long document to prove me wrong?
-Geoff
|