Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 better/longer dns retries on lookup failure - ID: 1043251
Last Update: Comment added ( karl-ia )

From a Gordon note:

"Concern is that on the NARA test crawls, some of the
domains that
didn't resolve on first try did resolve a few days
later on retry.
Right now, a negative DNS lookup is considered final,
so the
other retries don't occur. We'd like to treat
dns-not-founds more
like connection-failures than like http-404s."

High priority because it came of the NARA meeting.


Michael Stack ( stack-sf ) - 2004-10-08 18:42

9

Closed

Fixed

Gordon Mohr

None

1.0.6

Public


Comments ( 9 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-255 -- please add further
comments at that location.


Date: 2004-12-29 04:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This fixed, plus some related subtle bugs around the retry
process in BdbFrontier (like fetchAttempts increments being
lost, so max-retries could never be met). Commit comment:

Fix for [ 1043251 ] better/longer dns retries on lookup failure
* AbstractFrontier.java
Add S_DOMAIN_UNRESOLVABLE to statuses requiring standard
retry delay
* BdbFrontier.java
When an item is retried, flush its state to underlying
queue, but only after clearing transient processing state.
* BdbWorkQueue.java
Refactor enqueue() to enable
flush-without-count-increment. (Important to keep running
fetchAttempts counter up-to-date.) Also, synchronize
count-changing operations, as accurate count is crucial to
operation, and increment/decrement is not atomic.
* PreconditionEnforcer.java
Ensure DNS URIs are never considered to need a DNS
prereq (the cause of -6 dns failures observed in FR crawl)


Date: 2004-12-17 02:47
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Reopening:

Igor reports on FR test crawls, DNS errors are coming
through as -6 rather than -1, not being retried at all, and
thus the seed is also failing as -6 with no retries.


Date: 2004-10-27 00:49
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

In HEAD, went further and split S_PREREQUISITE_FAILURE code
into 4 more specific codes:

/** DNS prerequisite failed, precluding attempt */
public static final int S_DOMAIN_PREREQUISITE_FAILURE = -6;

/** Robots prerequisite failed, precluding attempt */
public static final int S_ROBOTS_PREREQUISITE_FAILURE =
-61;
/** DNS prerequisite failed, precluding attempt */
public static final int S_OTHER_PREREQUISITE_FAILURE = -62;
/** DNS prerequisite failed, precluding attempt */
public static final int
S_PREREQUISITE_UNSCHEDULABLE_FAILURE = -63;

If/when the seeds report can be upgraded to better report on
items in progress, or when fetching a URI ultimately fails,
these will help give a better indication of exactly what
went wrong.

Commit comment:

Fix & more for [ 1043251 ] better/longer dns retries on
lookup failure
* AbstractFrontier.java, HostQueuesFrontier.java
Makes S_DOMAIN_UNRESOLVABLE a retryable failure
* FetchStatusCodes.java, CrawlURI.java
Split S_PREREQUISITE_FAILURE into 4 distinct codes, for
domain/robots/other/unscheduleable failures
* PreconditionEnforcer.java, FetchHTTP.java, Postselector.java
Use new, more-specific codes
* user_manual.xml
Document updated codes



Date: 2004-10-22 02:20
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix applied. Commit comment:

Fix for [ 1043251 ] better/longer dns retries on lookup failure
* Frontier.java
Add S_DOMAIN_UNRESOLVABLE to those statuses triggering a
delayed retry.



Date: 2004-10-21 22:05
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Proposed fix works, but makes seeds-report somewhat less
useful: seeds whose DNS prereq are still being tried are
indistinguishable from those which DNS-resolved, but no HTTP
server answered for the URL or its robots.

Treating this as an acceptable limitation for now.


Date: 2004-10-21 19:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Believe least-risk fix is to add S_DOMAIN_UNRESOLVABLE to
the retryable codes in Frontier.needRetrying(), so that DNS
failures get as many retries, with as much delay between
them, as HTTP connection failures.




Date: 2004-10-21 01:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Changing to a top-priority bug, because current crawler
behavior is: once DNS fails, that host is marked as both
looked-up, and failed -- and from that point on, all URIs on
that host will fail. This happens even if DNS succeeded
earlier. So you can have a situation where, when DNS info
expires, and a relookup is attempted, a lookup failure
spoils all the still-pending deep URIs on the site.

This bit some long-running crawls today during a temporary
net outage.




Date: 2004-10-12 00:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Knocking down to normal priority. We have a workaround for
the immediate need; looking through logs for failed lookups
trying those that failed in a later crawl.


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
close_date 2004-10-27 00:49 2004-12-29 04:59 gojomo
status_id Open 2004-12-29 04:59 gojomo
status_id Closed 2004-12-17 02:47 gojomo
status_id Open 2004-10-27 00:49 gojomo
resolution_id None 2004-10-27 00:49 gojomo
close_date - 2004-10-27 00:49 gojomo
artifact_group_id None 2004-10-21 19:26 stack-sf
assigned_to nobody 2004-10-21 19:16 gojomo
priority 5 2004-10-21 01:04 gojomo
data_type 539099 2004-10-21 01:04 gojomo
priority 8 2004-10-12 00:48 stack-sf