Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Refetching of robots and/or DNS broken - ID: 848661
Last Update: Comment added ( karl-ia )

When the (currently code-configured) robots.txt or DNS
IP validity period expires, the existing info should be
discarded and new fetches performed. This requires
special handling, as the current assumption is that no
URI should be fetched more than once.

While this special handling previously worked, it does
not seem to be currently. (Several of the pre-ISS eval
crawls experienced problems exactly 24 hours after they
began, and after upping the expiration period to 3
days, our ISS eval crawl #2 experienced problems at
exactly 3 days.)


Gordon Mohr ( gojomo ) - 2003-11-25 00:18

7

Closed

Fixed

Gordon Mohr

Manners

0.8.0

Public


Comments ( 3 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-42 -- please add further
comments at that location.


Date: 2004-04-14 22:27
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Testing with artificially small values results in expected
behavior: DNS and robots info refetched at the right times.
Closing.


Date: 2004-02-20 03:00
Sender: johnerikProject Admin

Logged In: YES
user_id=896276

Most of this is fixed, but I am not sure that the DNS expiry
is ever checked and I'm not sure where this should be done.
Probably the right place is in the PreConditionEnforcer. The
code to be added should be something like:
if <DNS.expired> {
curi.setForcedPrerequisiteUri(<DNS-uri>);
}



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-04-14 22:28 gojomo
close_date - 2004-04-14 22:28 gojomo
resolution_id None 2004-04-14 22:27 gojomo
artifact_group_id None 2004-03-31 01:12 gojomo
assigned_to johnerik 2004-02-20 03:00 johnerik