The debug info indicates that you are successfully connecting to the
server hosting the site, but that server is choosing to tell you that
the pages do not exist. Is this a site that you control or have some
agreement with? One explanation that fits the facts is that the
server has been configured to deny you access and redirect your
requests to a 404 error page (perhaps based on IP address or
similar). One way to at least partially test this possibility would
be to ssh back to your server and try requesting the page with a text
browser. If there is a block based solely on IP address or server/
domain name, you should see the same 404 response.
If you are sure that you aren't being blocked, you might try copying
your config file, changing the start_url, and indexing some other
site just to make sure all the settings are sane.
I tried using htdig (3.1.6) to start indexing this site and had no
problem retrieving pages with a nearly stock configuration.
On Mar 8, 2007, at 2:00 PM, Clint Davis wrote:
> I ran rundig from an ssh session to the server. I can pull up the
> first page
> from my desktop with no problem. I can also retrieve the robots.txt
> with no
> problem via my desktop browser.
> Any other ideas?
> On 3/8/07 2:51 PM, "Jim Cole" <lists@...> wrote:
>> For some reason htdig was unable to retrieve the first page from the
>> site in question. The server is claiming that the file does not exist
>> (404 response). If this only happened at one time, or is always
>> happening at the same time, it might be due to a server problem,
>> server maintenance, etc. If it is happening all the time, a first
>> step would be to fire up a browser on the machine that runs htdig and
>> make sure you can load the page from there.
>> The "DB2 problem..." message is just due to the fact there was
>> nothing in the database when htmerge ran.
>> On Mar 8, 2007, at 9:46 AM, Clint Davis wrote:
>>> After using Htdig for years, I just noticed that one of my sites
>>> hasn't been
>>> indexed properly in a while.
>>> pick: http://www.realtree.com, # servers = 1
>>> 0:0:0:http://www.realtree.com/: Retrieval command for
>>> http://www.realtree.com/: GET / HTTP/1.0
>>> User-Agent: htdig/3.1.6 (webmaster@...)
>>> Host: http://www.realtree.com
>>> Header line: HTTP/1.1 404 Not Found
>>> htmerge: Sorting...
>>> htmerge: Removing doc #0
>>> DB2 problem...: missing or empty key value specified
>>> Deleted, no excerpt: 0/http://www.realtree.com/