Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 Allow operator-configured mid-HTTP-fetch filters - ID: 941072
Last Update: Comment added ( karl-ia )

User (miles crawford @ UW) wants to use Heritrix to
crawl just certain (text-oriented) documents. Of
course, for many URIs, you have to begin fetching the
document before you know its type. However, if it's not
of the desired type, it would be beneficial to abort
the fetch early, rather than complete the fetch (or
wait for it to hit other length/time limits) and then
ignore it.

This could probably be best achieved with some
operator-specified filters hooked into FetchHTTP
between the header-fetch and the readFully...


Gordon Mohr ( gojomo ) - 2004-04-24 00:13

9

Closed

None

Michael Stack

None

None

Public


Comments ( 6 )

Date: 2007-03-14 01:29
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-764 -- please add further
comments at that location.


Date: 2004-10-05 18:04
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Actually close.


Date: 2004-10-05 18:04
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented. Added doc to user manual and faq.

Fearture [ 941072 ] Allow operator-configured mid-HTTP-fetch
filters
* src/articles/user_manual.xml
Added documentation for ContentTypeRegExpFilter and for
midfetch filters in FetchHttp.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Add midfetch filters.
(addResponseContent): Added.
* src/java/org/archive/crawler/framework/Processor.java
(filtersAccept): Added override used calling midfetch
filters.
* xdocs/faq.fml
Added faq on how to download specific mimetypes only.



Date: 2004-10-05 18:03
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented. Added doc to user manual and faq.

Fearture [ 941072 ] Allow operator-configured mid-HTTP-fetch
filters
* src/articles/user_manual.xml
Added documentation for ContentTypeRegExpFilter and for
midfetch filters in FetchHttp.
* src/java/org/archive/crawler/fetcher/FetchHTTP.java
Add midfetch filters.
(addResponseContent): Added.
* src/java/org/archive/crawler/framework/Processor.java
(filtersAccept): Added override used calling midfetch
filters.
* xdocs/faq.fml
Added faq on how to download specific mimetypes only.



Date: 2004-09-18 01:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

The httpclient 3.0alpha has support for mid-fetch shutdown.
Its purportedly stable. I'll try it as part of this RFE.


Date: 2004-09-02 14:41
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Take on if-modified-since and check for mime-types as tests
this feature needs to pass for it to be finished. Here's a
note from listing looking for the if-modified-since facility
(Mime-type check has been mentioned on the list in the past).


Tom Emerson wrote:

>Phil White writes:
>[...]
>
>>As a result, it's necessarily a longer term project and
I'd prefer to
>>not have my DSL pegged out for the next 3 years or so. 8)
>
>[...]
>
>There has been a lot of research done on how to select URLs for
>subsequent crawling: the major search engines certainly
don't recrawl
>their entire catalog on a regular basis. Searching on
Google (you'll
>find papers by Sergei Brin and Larry Page, who both worked
on this
>problem) or on CiteSeer will show a bunch. However, for
your task this
>is probably overkill.
>
>One hack comes to mind, which may or may not work:
>
>In the Expert Settings for the crawl you can add "Accept"
headers to
>the request. It turns out that the way I implemented this
allows you
>to add *any* header to the request. The upshot is that you
could try
>adding an 'If-Modified-Since:' header to the subsequent
crawls, giving
>the date of your initial crawl. It isn't perfect, but it
may help.
>
>You could also write a script that extracts all the URLs
and then
>sends a HEAD request to determine which ones have
changed... I was
>thinking of writing something like this, but have not
gotten around to
>it.
>
If the 'If-Modified-Since' header add doesn't work, we did a
little planning yesterday and the feature '[ 941072 ] Allow
operator-configured mid-HTTP-fetch filters' is to be done
for an October 1st-ish release (1.2). This feature would
introduce filters after the headers have been downloaded but
before we start in on the body. Filters will say yes or no
on whether to proceed. Let me take on the above as a test
this feature needs to pass (Another will be a mime-type filter).

St.Ack


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-10-05 18:04 stack-sf
close_date - 2004-10-05 18:04 stack-sf
assigned_to nobody 2004-09-01 23:19 gojomo
priority 6 2004-09-01 21:49 gojomo
priority 5 2004-07-29 00:55 gojomo