Re: [htdig-dev] 3.2.0 - is it worth it??

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, 2004-05-13 at 08:58, Lachlan Andrew wrote:
> My impression is that the ht://Dig project is basically dead :(  The 
> existing code is of course still functional (and thanks Jim and 
> Gilles for all the support you give to the users!), but I don't think 
> there is enough enthusiasm to either release a new version, either 
> 3.2 or 3.3. If I get enthusiastic in the next couple of weeks, I 
> might still try to put 3.2.0b6 together, but that is about as far as 
> it will go...

  As a user, I'm very sorry to hear this -- I just deployed 3.2.0b5 on a
site I administrate, and I've been very pleased with it.  I've been
waiting until the 3.2.0 release cycle was over to start trying to
contribute some of my tweaks (I also need to talk to my employer about
the legalities first), but I guess if the project is stagnating I should
speak up (there's also the possibility that the imminent death of htdig
just makes this extra silly, of course... *shrug*).  In any case:
  So htdig does a bad job when multiple documents match a search in
similar ways; this shows up particularly when your search query matches
part of the header or footer of a section of your site, or when your
search results include threads from a mailing list archive (in which
case messages within a thread often show up consecutively in the
results, which adds a lot of noise).  I wrote some code (shoehorned in
as a ScoreMatch, more for easy control by the 'sort' parameter than for
any logical reason) which sorts the results once, then reduces the score
of any match which is similar to matches that are higher in the list,
then resorts the results; thus the high-ranked results that are returned
tend to be more unique than otherwise.  This is marginally helpful with
the header/footer problem (though the excerpts are still usually
identical in that case), and very helpful with the mailing-list-thread
problem. AFAICT it doesn't do too much harm to the results in the normal
case.
  We also found it beneficial to tweak results' scores by matching their
URLs against a handmade list of URL pieces and score-hacking factors
(mailing list archives are mediocre, IRC archives are usually unhelpful,
a particular section of documentation is generally very useful) -- I
know this is gross, but it did wonders for the effectiveness of our
search results, and a coworker of mine convinced me that it's not
totally against nature -- humans really do have special knowledge of
which sections of a site are generally "good," and with an hour or so of
tweaking we got things in a state where close results from a "bad"
section are presented above loose results from a "good" section when
appropriate (more or less).
  It seems to me that it would be useful to generalize these little
hacks into a search parameter listing which hacks should be applied; for
example, to select the two score hacks described in the above paragraphs
you could specify 'result_hacks=unique,urlmatch' in the search query or
htdig.conf.  htdig already has a couple of result hacks that could fit
into this scheme (backlink_factor and date_factor), and I can think of
one more at least that I'd like to add in my copious free time.  It
certainly would seem right to me to be able (a) to add stuff like the
above tweaks to the codebase without forcing everyone to care about it,
and (b) to test, tweak, and reorder the scoring hacks from a query
parameter while trying to get things configured to work well.
  As I said, I've got (wrongly-integrated) code for the two tweaks I
mentioned, which I can try to get into a presentable state, and I might
be able to find time semi-soon to do the work for the general
result_hacks parameter, if there are people that think either of those
would be worthwhile.  Are there such people?