Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 If filter in main scope disabled heritrix aborts imme - ID: 1103015
Last Update: Comment added ( karl-ia )

this happens with a no-mods 1.2 (java 1.4.2) and with
(who knows exactly what I have) version 1.3 (jdk1.5.0_01)

if a filter in the main scope is disabled heritrix will
immediately abort

simpliest way to reproduce the problem is

create new job from default (or simple) profile
add a seed url
go to settings
disable pathologicalpath
fix user-agent & from fields
start job

perhaps different but probably the same problem....
disable a filter in the main scope while it is running
and it hangs


dave skinner ( frodobay ) - 2005-01-15 17:42

7

Closed

Fixed

Nobody/Anonymous

None

None

Public


Comments ( 7 )

Date: 2007-03-14 00:20
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-338 -- please add further
comments at that location.


Date: 2005-03-23 22:51
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added documentation to the filter that the pathological
filter returns false if the path is not pathological.

Added ability for filters to override whats returned when
they are disabled.

Made it so the pathological filter returns false if its
disabled.

Made it so pathdepth, the other filter usually included as
part of the exclusion OrFilter, returns whatever its
path-less-or-equal-return setting is.

Addresses this issue only. New scoping model will make it
so its easier to figure why filters are working the way they do.

Closing.

Below is commit.

Fix for '[ 1103015 ] If filter in main scope disabled
heritrix aborts imme'.
* src/java/org/archive/crawler/filter/OrFilter.java
Fixed help message. Formatting.
* src/java/org/archive/crawler/filter/PathDepthFilter.java
Formatting. Override of getFilterOffPosition that
returns result of
returnTrueIfMatches.
*
src/java/org/archive/crawler/filter/PathologicalPathFilter.java
Make it clear that this filter normally returns FALSE so
if disabled,
we do the right thing when part of exclude filters, the
usual usage for
this filter..
* src/java/org/archive/crawler/filter/URIRegExpFilter.java
* src/java/org/archive/crawler/framework/CrawlScope.java
Reformatting.
* src/java/org/archive/crawler/framework/Filter.java
Formatting.
(getFilterOffPostion): Added.



Date: 2005-03-02 19:47
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Deferring consideration for now -- new scoping (filtering)
based on rules should be easier to understand, may obviate
this concern. Will reevaluate after more of that work is done.


Date: 2005-02-09 19:56
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Changed summary.


Date: 2005-02-01 22:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

More from Dave Skinner below (I've upped the priority):


> Thats right. If your bug is not getting what you think
appropriate attention, then you might 'whine' on the list
about it.


I'd like to see the philosophy of filter returns addressed.
my bug report(s) on filters were based around pre-fetch
filters. but I've now got a mid-fetch filter that needs to
return true if it is going to work*sigh* I've not tried
having it configured but not-enabled, but I think FetchHTTP
and it would work correctly.

so lets see.... just to get my head on straight.....
pre-fetch filters return false so that the curi is not
filtered out and ignored. mid-fetch filters return true so
that normal processing of the curi continues. Opposite of
each other

maybe one solution is to define that all filters return
"true" to continue "normal processing". this definition
addresses the mind set (and confusion) issues about what a
filter is supposed to do. (by extension of this, a filter
that is not enabled is assumed to return true. ie continue
"normal processing"). I think we can assume that anyone
concerned about this knows what "normal processing" is.
(this would require both a coding and documentation change
to the OR filter)

or maybe the solution is that in addition to the "enabled"
attribute (ie ATTR_ENABLED) being defined by class Filter,
to also define an attribute (something like ATTR_IGNORED)
that indicates what an *ignored* or not enabled filter would
return. I dont have a sense of the history, but was that
what ATTR_INVERTED (commented out) was maybe trying to do?

but this is much more messy and convoluted than the first
solution and does not directly address the mind mud issue.

I'm about ready to change all my code so that instead of doing

return true ; or return false ;

it does something like ;

return OKreturn ; or return ! OKreturn ;

the OKreturn value could be (or ! be) the value of
ATTR_IGNORED. that might make it possible for the same
filter to be used in multiple places in the processor flow.

maybe we should take this public, but I doubt the average
person has a clue what the issues are. I doubt if I have a
complete picture.

if you want to attach this (or portions) to the bug report
its OK.


Dave Skinner dave at solid dot net
High Performance Programming---assembly (lots of them),
C, java, perl
Database and Non-trivial web site implementations
Real-time and embedded systems are my specialty




Date: 2005-01-20 18:27
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

(Thanks for filing the issue Dave)

From Kris, a related comment:

"I wanted to set up a HostScope with no additional scope.
Since it comes standard with additionalScopeFocus and
transitiveFilter filters to 'filter in' additional URIs, I
figured that simple disabling them (enable set to false)
would do the trick. Seems logical, right? Err, no, not
according to Heritrix. This caused EVERYTHING to be in
scope! At least that's how it looked.

So, that probably needs to be looked at a bit.

To do what I wanted, I set the additionalScopeFocus to,
if-match-return: false, custom pattern .* (so, match all
URIs and return false on all matches, effectively turning it
off, but in a messy way). The transitiveFilter was simpler,
just set everything to zero. Still this is a messy way to do
things. "

I upped the priority (Two people have reported on it and
there is a suggested fix to try).


Date: 2005-01-19 05:43
Sender: frodobay

Logged In: YES
user_id=1197824

this is fixed with the following change to framework/Filter.java

public boolean accepts(Object o) {
CrawlURI curi = (o instanceof CrawlURI) ? (CrawlURI)
o : null;

// Skip the evaluation if the filter is disabled
try {
if (!((Boolean) getAttribute(ATTR_ENABLED,
curi)).booleanValue()) {
//// return true;
return false;
}
} catch (AttributeNotFoundException e) {
logger.severe(e.getMessage());
return true ;
}

the fix is the line with the //// on the front. returning
true at this point makes heritrix slurp all the rest of the
crawl anytime a filter is disabled at the global level.

I added the "return true" in the catch to cause the rest of
the crawl to be slurped up. It it occurs, something is
seriously wrong and it needs to be looked at

another issue in Filter is whether

protected boolean innerAccepts(Object o) {
return true;
}
should be returning false. As all the filters included with
heritrix are exclusion filters, false would seem to be the
"ignore* the filter case

see new bug report [ 1105025 ] for more about true vs false
in filters



Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
summary If filter in main scope disabled heritrix aborts imme 2007-03-14 00:20 karl-ia
resolution_id None 2005-03-23 22:51 stack-sf
close_date - 2005-03-23 22:51 stack-sf
status_id Open 2005-03-23 22:51 stack-sf
summary filter problem 2005-02-09 19:56 stack-sf
priority 6 2005-02-01 22:40 stack-sf
priority 5 2005-01-20 18:27 stack-sf