From: Dennis W. <den...@mi...> - 2005-06-28 21:58:20
|
Hello All, I am using HTDig 3.1.6 on a large web site that has many aliases for pages, so different URLs point to the same content. This is causing duplicate search results since HTDig is using the URL as the unique id. People are also not consistent with how they write URLs so http://www.military.com/spouse and http://www.military.com/spouse/ (note trailing slash) and these are coming up as different results as well. I have tried a few different things like search_rewrite_rules ( search_rewrite_rules: http://(.*)/$ http://\\1 ), but the regex was too greedy and htsearch displayed duplicate results anyway. My next guess is url_rewrite_rules, but I am unsure how to write the regexes and if htsearch will dedupe results with the same URL after rewriting. How can I get htsearch to rewrite these URLs and dedupe the ones that end up being the same? Some of the URLs are very ugly and would require complex regexes. If I cannot do it within the HTDIG framework, I may have to htdump indexes created by htdig, post processing the dumpfiles with a perl script that munges the URLs as needed and then load and merge the new indexes. If that is not possible I may have to munge the search results on the fly and not display the dupes (ugh!) Dennis Watson dw...@mi... UNIX System Administrator Military.com |