From: <mic...@bt...> - 2005-07-03 11:13:14
|
I would have thought that the example that you give below should have been handled by the http://www.htdig.org/attrs.html#remove_default_doc setting. Have you looked into that? As for the other part, if you know what the aliases are on the server (can you copy them from a config file?) then you can probably use the http://www.htdig.org/attrs.html#server_aliases setting. Mike > -----Original Message----- > From: htd...@li...=20 > [mailto:htd...@li...] On Behalf=20 > Of Dennis Watson > Sent: 28 June 2005 22:58 > To: 'htd...@li...' > Subject: [htdig] Eliminating Duplicate Search Results >=20 >=20 > Hello All, >=20 > I am using HTDig 3.1.6 on a large web site that has many=20 > aliases for pages, > so different URLs point to the same content. This is causing=20 > duplicate > search results since HTDig is using the URL as the unique id.=20 > People are > also not consistent with how they write URLs so > http://www.military.com/spouse and=20 > http://www.military.com/spouse/ (note > trailing slash) and=20 > these are coming up as different results as well. >=20 > I have tried a few different things like search_rewrite_rules ( > search_rewrite_rules: http://(.*)/$ http://\\1 ), but the=20 > regex was too > greedy and htsearch displayed duplicate results anyway. My=20 > next guess is > url_rewrite_rules, but I am unsure how to write the regexes=20 > and if htsearch > will dedupe results with the same URL after rewriting. >=20 > How can I get htsearch to rewrite these URLs and dedupe the=20 > ones that end up > being the same? Some of the URLs are very ugly and would=20 > require complex > regexes. If I cannot do it within the HTDIG framework, I may=20 > have to htdump > indexes created by htdig, post processing the dumpfiles with=20 > a perl script > that munges the URLs as needed and then load and merge the=20 > new indexes. If > that is not possible I may have to munge the search results=20 > on the fly and > not display the dupes (ugh!) >=20 >=20 > Dennis Watson dw...@mi... > UNIX System Administrator Military.com >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. = http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclick > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general >=20 |