Greetings all,
I'm checking through phrase searching, and have found several possible=20
bugs. First, some questions...
1. Why do the documentation for external_parser and the comments=20
before Retriever::got_word both say that the word location must be=20
in the range 0-1000? The HTML parser doesn't stick to that. If=20
locations are just scaled down (rather than reduced modulo 1001),=20
that will break the phrase searches. Is there a maximum in practice?
2. Every "meta" data entry (<title>, <meta ...> etc.) gets added as if=20
it starts at location 0. This gives *heaps* of false-positives,=20
because the second word of *any* entry is deemed adjacent to the=20
first word of any *other* entry. Could we add "meta" information at=20
successive locations starting from, say, location 10,000?
3. With phrase searching, do we still need valid_punctuation? For=20
example, "post-doctoral" currently gets entered as three words at the=20
*same* location: "post", "doctoral" and "postdoctoral". Would it be=20
better to convert queries for post-doctoral into the phrase "post=20
doctoral" in queries, and simply the words post and doctoral at=20
successive locations in the database? As it stands, a search for=20
"the non-smoker" will match "the smoker", since all the words are=20
given the same position in the database, but a search for "the non=20
smoker" won't match "the non-smoker". This also reduces the size of=20
the database (marginally in most cases, but significantly for=20
pathological documents). Now that there is phrase searching, is=20
there any benefit of the current approach?
4. Does anybody know what the existing external parsers do about words=20
less than the minimum length? Because they are passed the=20
configuration file, they *could* omit them. Currently the HTML=20
parser omits them, but that introduces false-positives into phrase=20
queries, and I want to fix that.
Thanks!
Lachlan
|