[htdig-dev] Should htmerge -m combine link descriptions?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi, folks.  I've been giving some thought to how htmerge -m works when
it finds the same URL in both databases.  Right now, it just tosses out
the older docdb record and keeps the newer one.  However, it occurs to
me that this could cause a loss of information that's collected during
a full dig, if you then merge in a more recent partial dig.  A full
dig would likely harvest more link descriptions for a given URL, and
a higher backlink count, than would a partial dig.  So, if the record
from the partial dig is more recent, it will clobber the more complete
information from the corresponding record in the full dig.

It seems to me that htmerge should look at both DocumentRef records
and take the higher backlink count, as well as combining all the link
description text (weeding out duplicates, presumably).  I guess it would
then also need to generate new wordlist entries for any new description
words for the new DocID.  Does this make sense?

This occurred to me as I was thinking about how htdig handles HTTP
redirects.  In that case, it transfers all the old pre-redirect
descriptions to the new redirected URL's DocumentRef.  It also takes
the smaller of the two hop counts, but it doesn't take the larger of
the two backlink counts, which strikes me as a bit of a bug there too.
Am I wrong?

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)