Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 Decompose Postselector to Scoping and Scheduling components - ID: 1119616
Last Update: Comment added ( karl-ia )

The Postselector currently does two distinct things:
filters discovered URLs by the active Scope, then
schedules the selected URLs into the Frontier.

The first step, especially, seems appropriate for
further extension: such as remembering all rejected
URLs. We don't want people to have to subclass/extend
the crucial Postselector to do this, so we could
decompose its operation into 2 processors.

This would also give operators a chance to insert their
own extra steps between Scope-filtering and rescheduling.


Gordon Mohr ( gojomo ) - 2005-02-09 21:00

8

Closed

None

Michael Stack

multimachine

1.6.0

Public


Comments ( 3 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-895 -- please add further
comments at that location.


Date: 2005-06-07 23:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented. Below is commit message. Closing.

Implement '[ 1119616 ] Decompose Postselector to Scoping and
Scheduling
components'
Postselector has been refactored into LinksScoper and
FrontierScheduler.
* src/articles/developer_manual.xml
* src/articles/user_manual.xml
Change Postselector references to LinksScoper and
FrontierScheduler
references.
* src/articles/releasenotes.xml
Added note that Postselector is gone to 1.6.0 changes.
* src/conf/heritrix.properties
Change Postselector to a LinksScoper reference.
* src/conf/modules/Processor.options
Remove Postselector. Add LinksScoper and FrontierScheduler.
* src/conf/profiles/deciding-default/order.xml
* src/conf/profiles/default/order.xml
* src/conf/selftest/order.xml
Remove Postselector. Add LinksScoper and FrontierScheduler.
Change name of (Postselector) LinksScoper attribute from
scope-rejected-uri-log-filters to
scope-rejected-url-filters.
* src/java/org/archive/crawler/datamodel/CandidateURI.java
Removed Postselector reference.
* src/java/org/archive/crawler/datamodel/CrawlURI.java
(hasPrerequisiteUri): Added.
(clearOutlinks, replaceOutlinks, outlinksSize): Added.
(createCandidateURI): Added utility methods using
current CrawlURI instance
and passed Link, construct CandidateURI.
* src/java/org/archive/crawler/prefetch/Preselector.java
Subclass Scoper. Formatting. Use parent class isInScope
method to check
scope.
* src/java/org/archive/crawler/framework/Scoper.java
*
src/java/org/archive/crawler/postprocessor/FrontierScheduler.java
* src/java/org/archive/crawler/postprocessor/LinksScoper.java
Added.
* src/java/org/archive/crawler/postprocessor/Postselector.java
Removed.


Date: 2005-06-06 19:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Taking this. For Australia.


Attached File

No Files Currently Attached

Changes ( 6 )

Field Old Value Date By
artifact_group_id None 2005-09-23 21:08 gojomo
status_id Open 2005-06-07 23:19 stack-sf
close_date - 2005-06-07 23:19 stack-sf
category_id None 2005-06-06 19:46 stack-sf
priority 5 2005-06-06 19:46 stack-sf
assigned_to nobody 2005-06-06 19:46 stack-sf