Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 Add referer header - ID: 1114133
Last Update: Comment added ( karl-ia )

Below is a message from the list by Dave Skinner:

I've noticed I'm getting different (and I hope better)
crawl results since
I put the following code into FetchHTTP.java

method.setRequestHeader("User-Agent", userAgent);
method.setRequestHeader("From",
order.getFrom(curi));
/////////////////dave skinner
// rfc 2616 says no referer header if referer
is https and the url
is not
String via = curi.flattenVia() ;
if ( ! via.equals("") && via.startsWith("http:") )
method.setRequestHeader("Referer", via) ;
/////////////////end dave skinner
// Set retry handler.

method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
new HeritrixHttpMethodRetryHandler());

This is working find for me but I'm sure someone can
supply a case where it
should not be done*sigh*. So I suppose it should
possibly be wrapped with
a parameter check.

However, maybe instead of a parameter check what should
be done is to check
that there is no referer or referrer header in
ATTR_ACCEPT_HEADERS. If
there is, suppress the automatic one, otherwise always
output it.

I'd be happy to code either (or both) of the above
modifications and test them.


Michael Stack ( stack-sf ) - 2005-02-01 17:16

7

Closed

None

Michael Stack

None

None

Public


Comments ( 10 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-891 -- please add further
comments at that location.


Date: 2005-03-07 21:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed Dave's suggestion. Its on by default. Below was
commit message (Missing comment that Dave was responsible
for the Referer suggestion).

Implemented '[ 1114133 ] Add referer header' and '[ 1143892
] [contribution]
SingleConnectionManager, range and close hdrs'.
Bulk of this commit is based on patch contributed by
Christian Kohlschuetter -- ck-heritrix at newsclub dot de
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Added new defines for Range, Referer, Connection.
Added new options to enable/disable Connection:Close,
Referer, and Range
header. Default is to have Connection:close, and
Referer be on by default
and to have Range be off by default.
Using Christians new ThreadedLocalHttpConnectionManager
in place of
MultiThreadedHttpConnectionManager (This change is also
having an effect on memory retention issues -- its
improving things).
*
src/java/org/archive/httpclient/ThreadLocalHttpConnectionManager.java
Added contribution by Christian. Difference between
this SingleCM
and the one of old is that it adds timeouts and that
rather than wait on
socket close before opening new, it puts the socket
close onto a queue for
a thread in background to close at when it gets cycles.



Date: 2005-03-07 21:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed Dave's suggestion. Its on by default. Below was
commit message (Missing comment that Dave was responsible
for the Referer suggestion).

Implemented '[ 1114133 ] Add referer header' and '[ 1143892
] [contribution]
SingleConnectionManager, range and close hdrs'.
Bulk of this commit is based on patch contributed by
Christian Kohlschuetter -- ck-heritrix at newsclub dot de
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Added new defines for Range, Referer, Connection.
Added new options to enable/disable Connection:Close,
Referer, and Range
header. Default is to have Connection:close, and
Referer be on by default
and to have Range be off by default.
Using Christians new ThreadedLocalHttpConnectionManager
in place of
MultiThreadedHttpConnectionManager (This change is also
having an effect on memory retention issues -- its
improving things).
*
src/java/org/archive/httpclient/ThreadLocalHttpConnectionManager.java
Added contribution by Christian. Difference between
this SingleCM
and the one of old is that it adds timeouts and that
rather than wait on
socket close before opening new, it puts the socket
close onto a queue for
a thread in background to close at when it gets cycles.



Date: 2005-03-02 19:07
Sender: aboyko

Logged In: YES
user_id=911462

I suppose I ought to point out, for the record, that I tried
Dave's patch and got the results I expected.


Date: 2005-03-02 16:50
Sender: aboyko

Logged In: YES
user_id=911462

Here's my non-voting +1 vote for adding Referer header.
We've found a site of interest (anntelnaes.com) that blocks
requests for images without a Referer header from the site.
(She's a political cartoonist, presumably trying to avoid
people linking directly to the content.) Defaulting the
use of the header to "on" makes sense; it's really hard to
conceive of a case where you'd want to turn if off, unless
WWGBD turns out to be "don't send it". I guess.


Date: 2005-02-10 00:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Chatting in the 1.4 release review meeting, we decided this
should be on by default.


Date: 2005-02-10 00:31
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Chatting in the 1.4 release review meeting, we decided this
should be on by default.


Date: 2005-02-09 21:53
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Shoudl definitely be an option. Perhaps should be on by
default. (WWGooglbotD?)


Date: 2005-02-09 19:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Upping priority (Submitted by outside party and easy enough
to do and offers possibility of improving crawler's standing
with crawled sites).


Date: 2005-02-01 21:01
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below, Dave supplies more on a crawler with a Referer
getting more stuff:

At 03:40 PM 2/1/2005, you wrote:

>Dave Skinner wrote:
>
> > I've noticed I'm getting different (and I hope better)
crawl results
> > since
> > I put the following code into FetchHTTP.java
>
>Would be great if you could confirm that you are indeed
getting better
>results.

see private reply to your archive.org address


> >
> > method.setRequestHeader("User-Agent", userAgent);
> > method.setRequestHeader("From",
order.getFrom(curi));
> > /////////////////dave skinner
> > // rfc 2616 says no referer header if referer
is https and
> > the url
> > is not
>
>Can you cite the section in rfc2616 where it says this
please Dave (I
>was unable to find where it said this).

see sections 15.1.2 and 15.1.3

I'm basing my comment on the following from 15.1.3

Clients SHOULD NOT include a Referer header field in a
(non-secure)
HTTP request if the referring page was transferred with
a secure
protocol.

btw, I did not worry about my code suppressing the header
when both were
https (my original code was a quick and dirty "try it and
see what happens
hack"). If you are going to put my code in I should expand
it and do it right.

>Being able to supply the 'referer' [sic] header makes sense
to me. A
>quick googling has it that spiders don't often supply this
header -- or
>if they do, its often with malicious intent to fill logs
with spammed
>links -- but a referer is also useful to remote
servers/administrators
>where the crawler is tripping over bad links.

my apache logs show that most (including google) do not
include it. I
figure the next bunch (or maybe some current) bad php
scripts will use
referrer to decide what content to respond with. At least
that's the way
I'd do some of the things they are attempting to do with 500
character URL's

>A patch that disabled this feature by default but that
allowed you to
>turn it on using an expert setting would be much appreciated.

I'll clean it up and send it to sourceforge

> I don't think you need worry about clashing with
>ATTR_ACCEPT_HEADER content. If operator wants to add a
Referer header to
>the ATTR_ACCEPT_HEADER list, they can disable the automatic
adding of
>referer feature.

OK, that makes it simpler

>(Here's an issue to cover the work:
>https://sourceforge.net/tracker/?group_id=73833&atid=539102&func=detail&aid=1114133).
>St.Ack



Dave Skinner dave at solid dot net
High Performance Programming---assembly (lots of them),
C, java, perl
Database and Non-trivial web site implementations
Real-time and embedded systems are my specialty


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-03-07 21:46 stack-sf
close_date - 2005-03-07 21:46 stack-sf
assigned_to nobody 2005-03-02 18:42 stack-sf
priority 5 2005-02-09 19:48 stack-sf