From: Gilles D. <gr...@sc...> - 2002-09-20 22:54:36
|
Hi, folks. I've been giving some thought to how htmerge -m works when it finds the same URL in both databases. Right now, it just tosses out the older docdb record and keeps the newer one. However, it occurs to me that this could cause a loss of information that's collected during a full dig, if you then merge in a more recent partial dig. A full dig would likely harvest more link descriptions for a given URL, and a higher backlink count, than would a partial dig. So, if the record from the partial dig is more recent, it will clobber the more complete information from the corresponding record in the full dig. It seems to me that htmerge should look at both DocumentRef records and take the higher backlink count, as well as combining all the link description text (weeding out duplicates, presumably). I guess it would then also need to generate new wordlist entries for any new description words for the new DocID. Does this make sense? This occurred to me as I was thinking about how htdig handles HTTP redirects. In that case, it transfers all the old pre-redirect descriptions to the new redirected URL's DocumentRef. It also takes the smaller of the two hop counts, but it doesn't take the larger of the two backlink counts, which strikes me as a bit of a bug there too. Am I wrong? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |