Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Add info to candidateURI before scheduling - ID: 1104916
Last Update: Comment added ( karl-ia )

From: "bergmark_d" <bergmark@C...>
Date: Sun Jan 16, 2005 12:29 pm
Subject: Postselector

ADVERTISEMENT

Hi, all -- One nice feature of Mercator was the ability
for the
CrawlURI (docBundle in Mercator) to carry history along
with it. Thus
you could not only make the crawl uri pass information
from one
processor to the next, but a parent URI could pass
information to
child URIs.

For adaptive focused crawling, it is very helpful to be
able to pass
miscellaneous information from parent pages to children.

In Heritrix, I can see where attributes could be copied
into child
URIs -- in the handleLinkCollection method of the
Postselector class,
for example, when the CandidateURI is formed. So I
would like to
extend Postselector and over-ride just this one method,
but it is
private.

How do you all decide which methods to make protected
and which
private? Maybe it would be nice to allow the
application programmer
to add attributes to the CandidateURI before it goes
onto the
frontier, perhaps by some call-back or by factoring out
a protected
and over-ridable method that decorates the CandidateURI.

Or perhaps there is some better way to add historical
information to
candidate URIs before the parent CrawlURI is lost
forever and before
the child CandidateURIs are added to the frontier?

Thanks for any thoughts/info before I go copying
humongous swatches of
code out of the Postselector into my own custom
Postselector.

D.


Michael Stack ( stack-sf ) - 2005-01-19 00:38

5

Closed

None

Michael Stack

API

None

Public


Comments ( 2 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-882 -- please add further
comments at that location.


Date: 2005-01-19 00:43
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. Did the below.

Fix for '[ 1104916 ] Add info to candidateURI before scheduling'
* src/java/org/archive/crawler/postprocessor/Postselector.java
Formatting.
(schedule): Changed access from private to protected so
it can be
overridden.



Attached File

No Files Currently Attached

Changes ( 2 )

Field Old Value Date By
status_id Open 2005-01-19 00:43 stack-sf
close_date - 2005-01-19 00:43 stack-sf