From: Jim <li...@yg...> - 2004-08-28 03:43:38
|
On Fri, 27 Aug 2004 ian...@di... wrote: > Just did my first run of rundig noticed that the search results were bring > up duplicates like: > > http://www.digitalhit.com/cr/reneezellweger > http://www.digitalhit.com/cr/reneezellweger/ > > and > > http://www.digitalhit.com/academy/73/index.shtml > http://www.digitalhit.com/academy/73/ > > Anyway to eliminate or weed those out? For the second case, take a look at the following. http://www.htdig.org/attrs.html#remove_default_doc This attribute allows you to specify that index.shtml is to be treated as a default document. Once you do that (and reindex) the index.shtml should be stripped before making the request. That should eliminate the duplication. For the first case, I am not certain what is happening. I suspect there is an issue with the way the web server is configured. Typically a web server will respond with some sort of "moved" status code (e.g. 301) and a pointer to a new location when a URL ending with a directory name is provided without a trailing slash. For example, a request for http://www.digitalhit.com/cr/reneezellweger should result in a moved status code and a new location of http://www.digitalhit.com/cr/reneezellweger/ htdig will drop the first due to the returned status code and then try to request the second. If in your case both are being indexed, the most likely cause is that the web server is configured in a non-standard way (e.g. special rewrite rules) and is returning the same document for both cases. Jim |