Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 [contrib] Preselector ATTR_ALLOW_BY_REGEXP - ID: 1388275
Last Update: Comment added ( karl-ia )

Hi again, Michael,

This contribution allows one to set a curi inclusion
regexp in the preselector. This turns out to be very
handy if your crawl should grab only an identified set
of mime types. I set it up to pretty carefully match
the exclusion regexp already in place.
Once again this is based on the 1.5 branch head.

Thanks,
Karl Wright


(Attached patch was made by St.Ack against current HEAD)


Michael Stack ( stack-sf ) - 2005-12-22 18:45

5

Closed

None

Michael Stack

API

1.8.0

Public


Comments ( 3 )

Date: 2007-03-14 01:45
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-991 -- please add further
comments at that location.


Date: 2005-12-22 19:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Applied with below message. Closing.

Implementing '[ 1388275 ] [contrib] Preselector
ATTR_ALLOW_BY_REGEXP'.
Contribution by Karl Wright of Metacarta (kwright at
metacarta dot com).
From email with Karl:
>> >> This contribution allows one to set a curi inclusion
regexp in the
>> >> preselector. This turns out to be very handy if your
crawl should
>> >> grab only an identified set of mime types. I set it
up to pretty
>> >> carefully match the exclusion regexp already in place.
>> >> Once again this is based on the 1.5 branch head.
> >
> > Patch looks good. Thanks for the contrib.
> >
> > Here's a question on motivation. So your regex looks at
file endings to
> > figure mimetypes? And you couldn't make this work using
the
> > ATTR_BLOCK_BY_REGEXP? (Negative regexes -- regexes for
things that do not
> > match an expression -- are awkward to write. This is
probably sufficent
> > reason for including your patch.)
> >
Right- we had a messy regexp, and making it negative was
well-nigh impossible.

* src/java/org/archive/crawler/prefetch/Preselector.java
Add attribute ATTR_ALLOW_BY_REGEXP.
(innerProcess): Test CrawlURI.toString against
ATTR_ALLOW_BY_REGEXP if
present.


Date: 2005-12-22 19:12
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

More on this patch. Below quote is email from me to Karl:

> Patch looks good. Thanks for the contrib.
>
> Here's a question on motivation. So your regex looks at
file endings to figure mimetypes? And you couldn't make
this work using the ATTR_BLOCK_BY_REGEXP? (Negative regexes
-- regexes for things that do not match an expression -- are
awkward to write. This is probably sufficent reason for
including your patch.)
>

Right- we had a messy regexp, and making it negative was
well-nigh impossible.


Attached File ( 1 )

Filename Description Download
allow_by_regexp.diff Download

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-12-22 19:22 stack-sf
assigned_to nobody 2005-12-22 19:22 stack-sf
close_date - 2005-12-22 19:22 stack-sf
File Added 160852: allow_by_regexp.diff 2005-12-22 18:45 stack-sf