|
From: Lachlan A. <lh...@us...> - 2003-02-28 12:23:28
|
Greetings all, I'm checking through phrase searching, and have found several possible=20 bugs. First, some questions... 1. Why do the documentation for external_parser and the comments=20 before Retriever::got_word both say that the word location must be=20 in the range 0-1000? The HTML parser doesn't stick to that. If=20 locations are just scaled down (rather than reduced modulo 1001),=20 that will break the phrase searches. Is there a maximum in practice? 2. Every "meta" data entry (<title>, <meta ...> etc.) gets added as if=20 it starts at location 0. This gives *heaps* of false-positives,=20 because the second word of *any* entry is deemed adjacent to the=20 first word of any *other* entry. Could we add "meta" information at=20 successive locations starting from, say, location 10,000? 3. With phrase searching, do we still need valid_punctuation? For=20 example, "post-doctoral" currently gets entered as three words at the=20 *same* location: "post", "doctoral" and "postdoctoral". Would it be=20 better to convert queries for post-doctoral into the phrase "post=20 doctoral" in queries, and simply the words post and doctoral at=20 successive locations in the database? As it stands, a search for=20 "the non-smoker" will match "the smoker", since all the words are=20 given the same position in the database, but a search for "the non=20 smoker" won't match "the non-smoker". This also reduces the size of=20 the database (marginally in most cases, but significantly for=20 pathological documents). Now that there is phrase searching, is=20 there any benefit of the current approach? 4. Does anybody know what the existing external parsers do about words=20 less than the minimum length? Because they are passed the=20 configuration file, they *could* omit them. Currently the HTML=20 parser omits them, but that introduces false-positives into phrase=20 queries, and I want to fix that. Thanks! Lachlan |