On Thu, 13 May 2004, Andrew Moise wrote:
> Date: Thu, 13 May 2004 12:46:01 -0400
> From: Andrew Moise <chops@...>
> To: lha@...
> Cc: Jim <lists@...>, htdig-dev@...
> Subject: Re: [htdig-dev] 3.2.0 - is it worth it??
> On Thu, 2004-05-13 at 08:58, Lachlan Andrew wrote:
> > My impression is that the ht://Dig project is basically dead :( The
> > existing code is of course still functional (and thanks Jim and
> > Gilles for all the support you give to the users!), but I don't think
> > there is enough enthusiasm to either release a new version, either
> > 3.2 or 3.3. If I get enthusiastic in the next couple of weeks, I
> > might still try to put 3.2.0b6 together, but that is about as far as
> > it will go...
> As a user, I'm very sorry to hear this -- I just deployed 3.2.0b5 on a
> site I administrate, and I've been very pleased with it. I've been
> waiting until the 3.2.0 release cycle was over to start trying to
> contribute some of my tweaks (I also need to talk to my employer about
> the legalities first), but I guess if the project is stagnating I should
> speak up (there's also the possibility that the imminent death of htdig
> just makes this extra silly, of course... *shrug*). In any case:
> So htdig does a bad job when multiple documents match a search in
> similar ways; this shows up particularly when your search query matches
> part of the header or footer of a section of your site, or when your
> search results include threads from a mailing list archive (in which
> case messages within a thread often show up consecutively in the
> results, which adds a lot of noise). I wrote some code (shoehorned in
> as a ScoreMatch, more for easy control by the 'sort' parameter than for
> any logical reason) which sorts the results once, then reduces the score
> of any match which is similar to matches that are higher in the list,
> then resorts the results; thus the high-ranked results that are returned
> tend to be more unique than otherwise. This is marginally helpful with
> the header/footer problem (though the excerpts are still usually
> identical in that case), and very helpful with the mailing-list-thread
> problem. AFAICT it doesn't do too much harm to the results in the normal
> We also found it beneficial to tweak results' scores by matching their
> URLs against a handmade list of URL pieces and score-hacking factors
> (mailing list archives are mediocre, IRC archives are usually unhelpful,
> a particular section of documentation is generally very useful) -- I
> know this is gross, but it did wonders for the effectiveness of our
> search results, and a coworker of mine convinced me that it's not
> totally against nature -- humans really do have special knowledge of
> which sections of a site are generally "good," and with an hour or so of
> tweaking we got things in a state where close results from a "bad"
> section are presented above loose results from a "good" section when
> appropriate (more or less).
> It seems to me that it would be useful to generalize these little
> hacks into a search parameter listing which hacks should be applied; for
> example, to select the two score hacks described in the above paragraphs
> you could specify 'result_hacks=unique,urlmatch' in the search query or
> htdig.conf. htdig already has a couple of result hacks that could fit
> into this scheme (backlink_factor and date_factor), and I can think of
> one more at least that I'd like to add in my copious free time. It
> certainly would seem right to me to be able (a) to add stuff like the
> above tweaks to the codebase without forcing everyone to care about it,
> and (b) to test, tweak, and reorder the scoring hacks from a query
> parameter while trying to get things configured to work well.
> As I said, I've got (wrongly-integrated) code for the two tweaks I
> mentioned, which I can try to get into a presentable state, and I might
> be able to find time semi-soon to do the work for the general
> result_hacks parameter, if there are people that think either of those
> would be worthwhile. Are there such people?
Count one;) I like both.
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jjah@...