Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 prerequisite hysteresis/robots ahead of dns - ID: 1038135
Last Update: Comment added ( karl-ia )

First observed in experimental BdbFrontier, but looks
like it could affect HostQueuesFrontier, given new
TieredQueue approach, too.

If a robots.txt prerequisite -- already scheduled with
highest ('HIGH'/immediate') priority -- happens to
trigger a DNS prerequisite, that prerequisite will (in
the current BdbFrontier and probably
HostQueuesFrontier, too) be enqueued 'behind' the
robots.txt.

In the BdbFrontier, this is because they both have
priority 0, meaning they are then ordered by
order-of-creation, where the robots.txt came first. In
the HostQueuesFrontier, currently using a 3-level
TieredQueue to allow priority-ordering of CrawlURIs,
this would happen because they both wind up in the
level-0 queue, ordered by first-enqueued. (Previously,
with the stack-based 'front-of-queue', prereqs were
pushed to the top, so most-recently-pushed was always
on top.)

Usually, this won't be a problem: a 'normal' URI
triggers the DNS lookup first, so DNS is always in
palce before a robots.txt fetch. Specifying a
robots.txt seed also won't cause any problem -- in that
case, it's a normally-slotted URI, rather than an
'immediate' one. But if/when DNS expires and has to be
refetched, if it's a robots.txt that triggers the
refetch, there will be problems.

I believe the most simple fix will be to expand the
number of priority levels, and always schedule prereqs
as '1 higher' their source CrawlURI.



Gordon Mohr ( gojomo ) - 2004-10-01 00:17

5

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-253 -- please add further
comments at that location.


Date: 2004-10-20 23:14
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Scheduling directives have been changed to 4-level int, and
when a URI is a prerequisite of another, it is given a
priority one level higher (lower number). Thus, there is
still typically room for a DNS to be a level before a
robots, etc.

Problem has not recurred in BdbFrontier or
HostQueuesFrontier since making this change. Closing.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-10-20 23:14 gojomo
resolution_id None 2004-10-20 23:14 gojomo
close_date - 2004-10-20 23:14 gojomo
assigned_to nobody 2004-10-12 02:09 gojomo