[htdig-dev] Several questions...

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Greetings all,

I'm checking through phrase searching, and have found several possible=20
bugs.  First, some questions...

1. Why do the documentation for  external_parser  and the comments=20
before  Retriever::got_word  both say that the word location must be=20
in the range 0-1000?   The HTML parser doesn't stick to that.  If=20
locations are just scaled down (rather than reduced modulo 1001),=20
that will break the phrase searches.  Is there a maximum in practice?

2. Every "meta" data entry (<title>, <meta ...> etc.) gets added as if=20
it starts at location 0.  This gives *heaps* of false-positives,=20
because the second word of *any* entry is deemed adjacent to the=20
first word of any *other* entry.  Could we add "meta" information at=20
successive locations starting from, say, location 10,000?

3. With phrase searching, do we still need  valid_punctuation?  For=20
example, "post-doctoral" currently gets entered as three words at the=20
*same* location:  "post", "doctoral" and "postdoctoral".  Would it be=20
better to convert queries for  post-doctoral  into the phrase "post=20
doctoral" in queries, and simply the words  post  and  doctoral  at=20
successive locations in the database?  As it stands, a search for=20
"the non-smoker" will match "the smoker", since all the words are=20
given the same position in the database, but a search for "the non=20
smoker" won't match "the non-smoker".  This also reduces the size of=20
the database (marginally in most cases, but significantly for=20
pathological documents).  Now that there is phrase searching, is=20
there any benefit of the current approach?

4. Does anybody know what the existing external parsers do about words=20
less than the minimum length?  Because they are passed the=20
configuration file, they *could* omit them.  Currently the HTML=20
parser omits them, but that introduces false-positives into phrase=20
queries, and I want to fix that.

Thanks!
Lachlan