[htdig-dev] Re: Residual database errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thursday 20 February 2003 02:19, Gilles Detillieux wrote:
> According to Lachlan Andrew:
> > I hadn't realised it, but the
> > valid_punctuation  attribute seems to be treated as an *optional*
> > word break.  (The docs say it is *not* a word break)
>
> I guess the docs haven't kept up with what the code does.
> this functionality was extended to also index each word part,
> so that something like "post-doctoral" gets indexed as
> postdoctoral, post and doctoral. This greatly enhances searches for
> compound words, or parts thereof, but it tends to break down when
> you're indexing something that's not really words...

Thanks for that clarification Gilles.

Would it be better to convert queries for  post-doctoral  into the=20
phrase "post doctoral" in queries, and simply the words  post  and =20
doctoral  in the database?  As it stands, a search for "the=20
non-smoker" will match "the smoker", since all the words are given=20
the same position in the database.  It also reduces the size of the=20
database (marginally in most cases, but significantly for=20
pathological documents).  Now that there is phrase searching, is=20
there any benefit of the current approach?  If not, we could do away=20
with  valid_punctuation  entirely (after 3.2.0b5).

> if you're going to feed a bunch of C code into htdig, you
> should probably do so with a severely stripped down setting of
> valid_punctuation.... However,
> if the underlying word database is solid, then it shouldn't fall
> apart no matter how much junk you throw at it.  the root
> cause of the trouble seems to be a bug somewhere in the code.

My thoughts exactly.  I'm only using this page for debugging...

Cheers,
Lachlan