From: Lachlan A. <lh...@us...> - 2003-02-22 08:50:37
|
On Thursday 20 February 2003 02:19, Gilles Detillieux wrote: > According to Lachlan Andrew: > > I hadn't realised it, but the > > valid_punctuation attribute seems to be treated as an *optional* > > word break. (The docs say it is *not* a word break) > > I guess the docs haven't kept up with what the code does. > this functionality was extended to also index each word part, > so that something like "post-doctoral" gets indexed as > postdoctoral, post and doctoral. This greatly enhances searches for > compound words, or parts thereof, but it tends to break down when > you're indexing something that's not really words... Thanks for that clarification Gilles. Would it be better to convert queries for post-doctoral into the=20 phrase "post doctoral" in queries, and simply the words post and =20 doctoral in the database? As it stands, a search for "the=20 non-smoker" will match "the smoker", since all the words are given=20 the same position in the database. It also reduces the size of the=20 database (marginally in most cases, but significantly for=20 pathological documents). Now that there is phrase searching, is=20 there any benefit of the current approach? If not, we could do away=20 with valid_punctuation entirely (after 3.2.0b5). > if you're going to feed a bunch of C code into htdig, you > should probably do so with a severely stripped down setting of > valid_punctuation.... However, > if the underlying word database is solid, then it shouldn't fall > apart no matter how much junk you throw at it. the root > cause of the trouble seems to be a bug somewhere in the code. My thoughts exactly. I'm only using this page for debugging... Cheers, Lachlan |