From: Andrew M. <ch...@de...> - 2004-05-13 16:46:04
|
On Thu, 2004-05-13 at 08:58, Lachlan Andrew wrote: > My impression is that the ht://Dig project is basically dead :( The > existing code is of course still functional (and thanks Jim and > Gilles for all the support you give to the users!), but I don't think > there is enough enthusiasm to either release a new version, either > 3.2 or 3.3. If I get enthusiastic in the next couple of weeks, I > might still try to put 3.2.0b6 together, but that is about as far as > it will go... As a user, I'm very sorry to hear this -- I just deployed 3.2.0b5 on a site I administrate, and I've been very pleased with it. I've been waiting until the 3.2.0 release cycle was over to start trying to contribute some of my tweaks (I also need to talk to my employer about the legalities first), but I guess if the project is stagnating I should speak up (there's also the possibility that the imminent death of htdig just makes this extra silly, of course... *shrug*). In any case: So htdig does a bad job when multiple documents match a search in similar ways; this shows up particularly when your search query matches part of the header or footer of a section of your site, or when your search results include threads from a mailing list archive (in which case messages within a thread often show up consecutively in the results, which adds a lot of noise). I wrote some code (shoehorned in as a ScoreMatch, more for easy control by the 'sort' parameter than for any logical reason) which sorts the results once, then reduces the score of any match which is similar to matches that are higher in the list, then resorts the results; thus the high-ranked results that are returned tend to be more unique than otherwise. This is marginally helpful with the header/footer problem (though the excerpts are still usually identical in that case), and very helpful with the mailing-list-thread problem. AFAICT it doesn't do too much harm to the results in the normal case. We also found it beneficial to tweak results' scores by matching their URLs against a handmade list of URL pieces and score-hacking factors (mailing list archives are mediocre, IRC archives are usually unhelpful, a particular section of documentation is generally very useful) -- I know this is gross, but it did wonders for the effectiveness of our search results, and a coworker of mine convinced me that it's not totally against nature -- humans really do have special knowledge of which sections of a site are generally "good," and with an hour or so of tweaking we got things in a state where close results from a "bad" section are presented above loose results from a "good" section when appropriate (more or less). It seems to me that it would be useful to generalize these little hacks into a search parameter listing which hacks should be applied; for example, to select the two score hacks described in the above paragraphs you could specify 'result_hacks=unique,urlmatch' in the search query or htdig.conf. htdig already has a couple of result hacks that could fit into this scheme (backlink_factor and date_factor), and I can think of one more at least that I'd like to add in my copious free time. It certainly would seem right to me to be able (a) to add stuff like the above tweaks to the codebase without forcing everyone to care about it, and (b) to test, tweak, and reorder the scoring hacks from a query parameter while trying to get things configured to work well. As I said, I've got (wrongly-integrated) code for the two tweaks I mentioned, which I can try to get into a presentable state, and I might be able to find time semi-soon to do the work for the general result_hacks parameter, if there are people that think either of those would be worthwhile. Are there such people? |