First observed in experimental BdbFrontier, but looks
like it could affect HostQueuesFrontier, given new
TieredQueue approach, too.
If a robots.txt prerequisite -- already scheduled with
highest ('HIGH'/immediate') priority -- happens to
trigger a DNS prerequisite, that prerequisite will (in
the current BdbFrontier and probably
HostQueuesFrontier, too) be enqueued 'behind' the
robots.txt.
In the BdbFrontier, this is because they both have
priority 0, meaning they are then ordered by
order-of-creation, where the robots.txt came first. In
the HostQueuesFrontier, currently using a 3-level
TieredQueue to allow priority-ordering of CrawlURIs,
this would happen because they both wind up in the
level-0 queue, ordered by first-enqueued. (Previously,
with the stack-based 'front-of-queue', prereqs were
pushed to the top, so most-recently-pushed was always
on top.)
Usually, this won't be a problem: a 'normal' URI
triggers the DNS lookup first, so DNS is always in
palce before a robots.txt fetch. Specifying a
robots.txt seed also won't cause any problem -- in that
case, it's a normally-slotted URI, rather than an
'immediate' one. But if/when DNS expires and has to be
refetched, if it's a robots.txt that triggers the
refetch, there will be problems.
I believe the most simple fix will be to expand the
number of priority levels, and always schedule prereqs
as '1 higher' their source CrawlURI.
Gordon Mohr
None
None
Public
|
Date: 2007-03-14 00:16
|
|
Date: 2004-10-20 23:14 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-10-20 23:14 | gojomo |
| resolution_id | None | 2004-10-20 23:14 | gojomo |
| close_date | - | 2004-10-20 23:14 | gojomo |
| assigned_to | nobody | 2004-10-12 02:09 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use