|
From: Gilles D. <gr...@sc...> - 2003-10-28 22:04:02
|
Yesterday, I wrote:
> I've pretty much run out of time to help out with this release, but before
> I leave you, I thought I'd submit the following patch for your testing and
> approval. It should fix the duplicate URL problem in htsearch collections,
> in bug #504087. I'm not sure what sort of performance impact it will have
> on large databases with lots of potential matches, though. That may be
> something to consider/test for.
>
> Cheers,
> Gilles
>
> --- htsearch/Display.cc.orig 2003-10-25 07:40:23.000000000 -0500
> +++ htsearch/Display.cc 2003-10-27 17:55:52.000000000 -0600
> @@ -1895,6 +1895,27 @@ Display::sort(List *matches)
> qsort((char *) array, numberOfMatches, sizeof(ResultMatch *),
> array[0]->getSortFun());
>
> + // In case there are duplicate URLs across collections, keep "best" ones
> + // after sorting them.
> + Dictionary goturl;
> + String url;
> + int j = 0;
> + for (i = 0; i < numberOfMatches; i++)
> + {
> + Collection *collection = array[i]->getCollection();
> + DocumentRef *ref = collection->getDocumentRef(array[i]->getID());
> + url = ref->DocURL();
> + HtURLRewriter::instance()->replace(url);
> + if (goturl.Exists(url))
> + delete array[i];
> + else
> + {
> + array[j++] = array[i];
> + goturl.Add(url, 0);
> + }
> + }
> + numberOfMatches = j;
> +
> const String st = config->Find("sort");
> if (!st.empty() && mystrncasecmp("rev", st, 3) == 0)
> {
I've had just a few more quick thoughts about this patch which I thought
I should share with you folks.
1) I think there should probably be an explicit delete of "ref" and
"collection" at the bottom of the for loop, to prevent a memory leak.
I don't think C++ would do that automatically, would it?
2) This strikes me as a potentially ineffecient approach to this problem,
as it would require a second fetching of every DocumentRef for all
matching documents. In the 3.1.6 code, the URL is available in the
ResultMatch class, so the collections.0 patch for 3.1.6 was able to do
this much less expensively. It may be that a better way would be to
catch this in Display::buildMatchList(), as the list of results is being
first built, and to somehow merge the results from the collections such
that the best score for a given URL is retained. That would be a bit
more complicated than my quick adaptation of the 3.1.6 patch above, and
I don't have the time now to do this properly. Any other takers?
3) You may want to consider how serious bug #504087 is, and whether it
needs to be fixed before this release. I don't doubt that this is a bug,
and not just a feature request, but it may not be serious enough to risk
introducing a potentially performance-impacting fix for it this close to
release. On the other hand, if you can come up with a quick, reliable
and "inexpensive" fix for this, then please do.
Cheers,
Gilles
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|