Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 regex for midfetch filter not being stored in crawl order - ID: 1379040
Last Update: Comment added ( karl-ia )

Using the 1.6.0 release code.

Created a new crawl, added a mid-fetch filter of type
ContentTypeRegExpFilter. In Settings I entered the
regex for it. After running the crawl and noticing it
didn't seem to be working correctly I looked at the
crawl order file and the midfetch filters section
looked like this:

<map name="midfetch-filters">
<newObject name="html-only"
class="org.archive.crawler.filter.ContentTypeRegExpFilter">
<boolean name="enabled">true</boolean>
<boolean name="if-match-return">true</boolean>
<string name="regexp"/>
</newObject>
</map>

Seems like it's not saving the regex to the file.

Rob Eger
Local Matters, Inc.
Denver, CO


Nobody/Anonymous ( nobody ) - 2005-12-12 22:02

6

Closed

Fixed

Gordon Mohr

configuration

1.10.0

Public


Comments ( 6 )

Date: 2007-03-14 01:03
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-528 -- please add further
comments at that location.


Date: 2006-08-21 22:26
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Considering fixed. Ideally will become moot when a
comprehensive settings redo is in place.


Date: 2006-04-18 22:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Applying pandae's suggested fix as simple, low-risk fix for
1.8. (Thanks!) Commit comment:

Initial fix for [ 1379040 ] regex for midfetch filter not
being stored in crawl order
* FetchHTTP.java
make midfetchfilters a non-expert setting, working
around problem with non-expert settings nested inside expert
settings

--
Should re-investigate post-1.8.


Date: 2006-04-16 13:15
Sender: pandae

Logged In: YES
user_id=962291

proposed fix:

file to edit:
src/java/org/archive/crawler/fetcher/FetchHTTP.java

line to remove:
this.midfetchfilters.setExpertSetting(true);


Date: 2006-04-16 13:12
Sender: pandae

Logged In: YES
user_id=962291

I further analyzed the issue and it seems like I found the
cause.
The settings for the midfetch-filters will only get saved in
"expert settings" mode. I think this behaviour is
unintentional because those settings are also displayed in
non-expert mode.

So I think there are two possible solutions. Either make
sure that those settings will only be seen in expert mode
(which will require to not only mark the midfetch-filter as
"expert mode" but also all its child settings) - or the
solution I would prefer simply not making this filter an
expert setting.




Date: 2006-04-13 10:11
Sender: nobody

Logged In: NO

I am experiencing the same issue using 1.7.3 CVS code
(checked out yesterday).


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2006-08-21 22:26 gojomo
resolution_id None 2006-08-21 22:26 gojomo
artifact_group_id None 2006-08-21 22:26 gojomo
close_date - 2006-08-21 22:26 gojomo
priority 7 2006-04-18 22:34 gojomo
assigned_to nobody 2006-04-18 01:54 gojomo
priority 5 2006-04-18 01:54 gojomo