Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 http and https prerequisites contention - ID: 1064887
Last Update: Comment added ( karl-ia )

http and https are using same CrawlServer instance in
1.0.5.

Usually all goes along fine if the http or https URL
that comes out of the frontier gets all of its prereqs
-- dns and robots. But if the first http or https URL
out of the frontier fails on its prereqs, though the
prereqs exist for the second http or https URL, the
second URL and all associated fails.

To reproduce, use one thread and two seeds:
https://www.army.mil/ and http://www.army.mil/. The
https seed needs to be first.


Michael Stack ( stack-sf ) - 2004-11-12 00:34

9

Closed

None

Michael Stack

None

None

Public


Comments ( 7 )

Date: 2007-03-14 00:18
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-290 -- please add further
comments at that location.


Date: 2004-11-15 18:12
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below patch by Gordon stops multple DNS queueings (I saw a
few repeated DNS lookups and thought that it was just one
per queue but more likely it was multiple prompted by https
--- the dns lookups were getting queued on the http queue
before this patch).

I think the following (together with your change) will do
the trick.
In addition to consulting the 'via' for DNS URIs, it also
sets the
'scheme' based on the 'via', so that the HTTPS-special
patching will
occur for DNS URIs triggered by HTTPS URIs. There's still
the possibility
one DNS URI will be enqueued for each queue on a particular
host,
which is slightly wasteful, but there shouldn't be any chance of
spinning multiple adds to a queue other than the one
triggering the
add.

- Gordon


Index: src/java/org/archive/crawler/datamodel/CrawlURI.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/datamodel/CrawlURI.java,v
retrieving revision 1.50.2.1
diff -u -r1.50.2.1 CrawlURI.java
--- src/java/org/archive/crawler/datamodel/CrawlURI.java
12 Nov 2004 00:41:46 -0000 1.50.2.1
+++ src/java/org/archive/crawler/datamodel/CrawlURI.java
13 Nov 2004 12:39:34 -0000
@@ -393,8 +393,10 @@
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
- candidate =
-
UURIFactory.getInstance(flattenVia()).getAuthority();
+ UURI viaUuri =
UURIFactory.getInstance(flattenVia());
+ candidate = viaUuri.getAuthority();
+ // adopt scheme of triggering URI
+ scheme = viaUuri.getScheme();
} else {
candidate=
FetchDNS.parseTargetDomain(this);
}


Date: 2004-11-12 01:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Didn't mean to reopen this.


Date: 2004-11-12 01:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Having a CrawlServer each, one for http and another https
means we do two dns lookups -- one as prereq for each. I'm
guessing this is fine.


Date: 2004-11-12 01:29
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. 1.2.0 doesn't have this problem. Here's what I get
when I crawl using the above described scenario (Using one
thread, etc.):

20041112012841007 1 60 dns:www.macromedia.com XP
http://www.macromedia.com/shockwave/download/ text/dns #001
726 - -
20041112012841469 1 65
dns:download.macromedia.com XP
http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab
text/dns #001 446 - -
20041112012841750 1 56 dns:www.arng.army.mil LP
http://www.arng.army.mil/news/ text/dns #001 276 - -
20041112012842223 1 55 dns:www.us.army.mil LP
https://www.us.army.mil/portal/portal_home.jhtml text/dns
#001 467 - -
20041112012842531 1 61 dns:www2.arims.army.mil
LP https://www2.arims.army.mil/rmdaxml/rmda/FPHomePage.asp
text/dns #001 302 - -
20041112012842950 1 57 dns:cpol.army.mil LP
http://cpol.army.mil/index.html text/dns #001 412 - -
20041112012843376 1 59 dns:www.armyg1.army.mil
LP http://www.armyg1.army.mil/retire text/dns #001 417 - -
20041112012843637 200 515
http://www.army.mil/elements/images/homepage/headings/featurePhoto_active.gif
E http://www.army.mil/ image/gif #001 248
VIU7LUD3OAPPFN7F2PGEHSOOUDCDWQIJ -
20041112012843644 -50 -
http://www.macromedia.com/shockwave/download/ X
http://www.army.mil/ no-type #001 - - 2t
20041112012843646 -50 -
http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab
X http://www.army.mil/ no-type #001 - - 2t
20041112012843649 -50 -
http://www.arng.army.mil/news/ L http://www.army.mil/
no-type #001 - - 2t
20041112012843651 -50 -
https://www.us.army.mil/portal/portal_home.jhtml L
http://www.army.mil/ no-type #001 - - 2t
20041112012843653 -50 -
https://www2.arims.army.mil/rmdaxml/rmda/FPHomePage.asp L
http://www.army.mil/ no-type #001 - - 2t
20041112012843654 -50 -
http://cpol.army.mil/index.html L http://www.army.mil/
no-type #001 - - 2t
20041112012843928 404 182
http://www4.army.mil/robots.txt EP
http://www4.army.mil/OCPA/uploads/featureStory/Stryker2004-11-10.jpg
text/html #001 273 HSGWPY3PGZD4ZSHBINIDRVLISDRSOBCK -
20041112012844032 200 569
http://www.macromedia.com/robots.txt XP
http://www.macromedia.com/shockwave/download/ text/plain
#001 101 WYSPMDKOX5TOQVSD57M5SBT6MQJEPZCF -
....


Date: 2004-11-12 00:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is Igor mail on this topic:

With www.army.mil example we found out that if
https://www.army.mil/robots.txt failed all
http://www.army.mil will fail with -6 errors.
The test was simple:
- 1 toe thread
- two seeds https://www.army.mil and http://www.army.mil.
- 3 short retries
- recovery log F+ https://www.army.mil and F+
http://www.army.mil

But this does not explain why other hosts failing with -6s
during recovery.

Also, we are seeing -5000 (out of scope) errors for handful
URLs at the beginning of recovery.
When is the scope being generated?
Does recovery process know anything about what entries are
seeds?

i.

> This must be working in many/most cases, or nothing would
be crawled
> after a recovery. I'll take a closer look at what
combination of events
> might cause -6s in a recovery.
>
> The way it's supposed to work:
> - if no current robots.txt info is available, a robots
URL is scheduled,
> without regard to whether it was ever scheduled before
> - the robots.txt URL is before any other URLs on the same
site,
> so it's impossible for any other URLs to be tried,
triggering
> extra robots URLs, until the first robots.txt URL
either succeeds
> or fails-after-all-retries
>
> - Gordon
>
> Igor Ranitovic wrote:
>
>> It seems that there is a bug where after a crawl is
recovered for some hosts we don't fetch a new copy of
robots.txt. So we are seeing a lot of -6s (prerequisites
failed).
>>
>> I tried adding robots.txt files to the seeds list hoping
that will be crawled again but no luck. They are being
ignored probably because already crawled (added from
recovery log.)
>> I tried force fetching them but those URI seems to go to
back of queues. I am not sure how to "refresh" server state
with new robots.txt info :(
>>
>> So, I will let the crawls run until we find out what the
problem is.
>> All URIs resulting with errors will be fetch again within
a new recovery.
>> i.
>>
>> P.S. I am not sure how we did not see this problem before.
>>
>>> Any idea why A2/crawling004 is so slow -- it almost
looks like it might
>>> have been affected by a net outage.
>>>
>>> Was A3/crawling006 ever recovered onto the 2nd fast machine?
>>>
>>> - Gordon
>>



Date: 2004-11-12 00:42
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed on the 1.0.5 branch. Will test later to make sure
we don't have same problem in HEAD.


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
assigned_to nobody 2004-11-15 18:12 stack-sf
close_date 2004-11-12 01:29 2004-11-12 01:43 stack-sf
status_id Open 2004-11-12 01:43 stack-sf
artifact_group_id 1.0.6 2004-11-12 01:42 stack-sf
status_id Closed 2004-11-12 01:42 stack-sf
resolution_id Fixed 2004-11-12 01:42 stack-sf
artifact_group_id None 2004-11-12 01:29 stack-sf
close_date - 2004-11-12 01:29 stack-sf
status_id Open 2004-11-12 01:29 stack-sf
resolution_id None 2004-11-12 01:29 stack-sf
priority 5 2004-11-12 00:35 stack-sf