Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 ensure IPs match from DNS, used in HTTP, logged in ARC - ID: 1154673
Last Update: Comment added ( karl-ia )

Currently it is highly likely, but not guaranteed, that
the IP that was found from the logged DNS operation is
used for the following HTTP operations. Similarly, it
is highly likely, but not guaranteed, that the IP
written in an HTTP ARC record is the IP that was
contacted for the content.

Consider the scenario:
(1) DNS lookup is triggered, logged to ARC, noted in
CrawlHost instance.
(2) Later, HTTP fetch is attempted. Change for bug [
902970 ] (HTTPClient should use supplied IP / avoid DNS
lookup) ensures DNSJava cache is checked -- but this is
not necessarily the same IP as in CrawlHost instance.
(Caching TTLs may vary -- we use a minimum regardless
of what the DNS recommended. Or, even if they match,
there could be a small window between when
PreconditionEnforcer decides the existing IP is OK, and
when FetchHTTP checks the DNSJava cache.) So, the
actual IP contacted may be different than the DNS info
that was previously logged.
(3) When logging that HTTP response to ARC,
ARCWriterProcessor.getHostAddress() looks directly back
to CrawlHost, and so may log an IP that was not used.

FetchHTTP MUST use an IP that was previously discovered
via a logged DNS response -- even if this requires us
to add new methods to HTTPClient to use a specified IP
address. (If that works properly, then the ARC issue
will resolve itself, but a way to be sure that the ARC
always shows the right IP would be for the HTTP
transaction to remember the IP it actually uses and
have the ARCWriterProcessor consult that value rather
than the CrawlHost cache).


Gordon Mohr ( gojomo ) - 2005-03-01 23:21

7

Closed

Fixed

Michael Stack

Protocols

None

Public


Comments ( 2 )

Date: 2007-03-14 00:21
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-370 -- please add further
comments at that location.


Date: 2005-03-04 02:14
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

We're using the IP thats cached in CrawlHost though, if it
ever came back null, we'd then go to DNS. I removed this
throwing an IOException instead (It should never happen).

Below is commit. Closing.

Address '[ 1154673 ] ensure IPs match from DNS, used in
HTTP, logged in ARC'.
concern.
*
src/java/org/archive/crawler/fetcher/HeritrixProtocolSocketFactory.java
(getHostAddress): Remove going to DNS if failed to get
cached IP from CrawlHost. Throw an IOException instead.
*
src/java/org/archive/crawler/fetcher/HeritrixSSLProtocolSocketFactory.java
(getHostAddress): Pass on the
HeritrixProtocolSocketFactory#getHostAddress IOException.





Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-03-04 02:14 stack-sf
resolution_id None 2005-03-04 02:14 stack-sf
close_date - 2005-03-04 02:14 stack-sf
assigned_to nobody 2005-03-02 19:21 gojomo