Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

3 Allow adding (subtracting?) http headers - ID: 945922
Last Update: Comment added ( karl-ia )

Date: Fri, 30 Apr 2004 21:59:41 -0400
To: Michael Stack <stack@archive.org>
Subject: Re: [archive-crawler] Cdx from arc files
Reply-To: tree@basistech.com
Return-Path: tree@basistech.com

Hi,

Is there a way I can add HTTP headers to the crawl? In
particular I
would like to add an Accept-Language: header ---
several of the sites
I want to crawl push English content by default unless
told that the
site is wants Arabic.

Thanks.

-tree

--
Tom Emerson
Basis Technology Corp.
Software Architect
http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and
you suck forever"


To do above, would need way of listing headers to set
and they'd be included per request somewhere around here:


Index: src/java/org/archive/crawler/fetcher/FetchHTTP.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fe
tcher/FetchHTTP.java,v
retrieving revision 1.45
diff -u -r1.45 FetchHTTP.java
--- src/java/org/archive/crawler/fetcher/FetchHTTP.java
28 Apr 2004 01:42:04 -0000 1.45
+++ src/java/org/archive/crawler/fetcher/FetchHTTP.java
1 May 2004 07:20:41 -0000
@@ -200,6 +200,7 @@
curi.getUURI().getURIString(), rec);
configureMethod(curi, method);
boolean addedCredentials =
populateCredentials(curi, method);
+ method.addRequestHeader(new Header());
int immediateRetries = 0;
while (true) {
// Retry until success (break) or
unrecoverable exception


Michael Stack ( stack-sf ) - 2004-05-01 14:51

3

Closed

None

Michael Stack

Configuration

None

Public


Comments ( 5 )

Date: 2007-03-14 01:30
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-770 -- please add further
comments at that location.


Date: 2004-07-07 15:51
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed patch submitted by Tom. I reviewed and tested
it. Below is commit message. Closing this issue (Lets open
a new issue for the case of 'subtracting' headers when
someone actually asks for it).

Fix for [ 945922 ] Allow adding (subtracting?) http headers
This patch was contributed by Tom Emerson (tree at basistech
dot com).
Here is what Tom has to say about the patch:
Attached is a patch to FetchHTTP.java that allows you to add
Accept
headers to the crawl (actually, as it turns out, it will let
you add
*any* header you want, even though the settings verbiage is
Accept
specific). I use this when crawling sites that I know do
language
negotiation, and I need text in a specific language.
The setting is made available in the HTTP section of the
'fetch-processors' settings when Expert Settings are shown.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
(accept-headers): New stringlist option added.
(maybeSetAcceptHeaders): Added.


Date: 2004-07-05 19:10
Sender: tree

Logged In: YES
user_id=37068

I have extended FetchHTTP.java to allow me to add arbitrary
accept headers to the fetcher configuration from the WUI.
Unfortunately I don't appear to be able to upload the patch
to this tracker item, though.



Date: 2004-07-02 23:09
Sender: tree

Logged In: YES
user_id=37068

IMHO opening up arbitrary header addtions could be problematic, since
the crawler is already adding some of its own.

Minimally adding an "accept-language" header would be fine, I think. One
could open it up to multiple accept headers: this would be easily done,
actually.

I'm working on a patch to FetchHTTP.java for this since I have an itch...


Date: 2004-05-05 21:20
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Lowered priority.


Attached File

No Files Currently Attached

Changes ( 3 )

Field Old Value Date By
close_date - 2004-07-07 15:51 stack-sf
status_id Open 2004-07-07 15:51 stack-sf
priority 5 2004-05-05 21:20 stack-sf