Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 Should support depth-first search priority scheduling (patch - ID: 1523276
Last Update: Comment added ( karl-ia )

Heritrix currently only crawls breadth-first, in that
it crawls all seeds before any discovered links.
Attached is a patch that adds an option to schedule
discovered links (up to an optional depth from a seed)
before remaining seeds to facilitate depth-first
search. It changes redirect and depth-first priorities
to HIGH, and makes pre-requisites 2 less than their
parent URI's priority (not sure if this is necessary).
Looking at the code that uses priority scheduling, I
believe this will work fine in any configuration except
for with the AdaptiveRevisit frontier. I have not
looked into whether it is compatible with that.


Eric C. Jensen ( ecjensen ) - 2006-07-16 05:01

5

Closed

None

Michael Stack

Configuration

1.10.0

Public


Comments ( 7 )

Date: 2007-03-14 01:48
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-1022 -- please add further
comments at that location.


Date: 2006-08-25 18:53
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing. Works for Eric. See below.

Date: Tue, 22 Aug 2006 16:23:37 -0500
From: Eric <ej@ir.iit.edu>
To: Michael Stack <stack@archive.org>
Subject: Re: Does this work for you?
Message-ID: <20060822212337.GA10653@duvel.ir.iit.edu>

...I tested it by creating a new job based on the default
profile (which had the appropriate classes, so i guess it is
a windows problem) and my depth-first code worked fine.




Date: 2006-08-18 23:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

ping eric


Date: 2006-08-04 18:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Patch applied with below commit message.

Eric, would you mind testing that it works for you? If so,
make a note in here and I'll close the issue as implemented.

Thanks.

[ 1523276 ] (contrib) Support depth-first search priority
scheduling
Contributed by Eric C. Jensen ecjensen at users.sourceforge.net
Reviewed and amended -- removed tabs, and had to refactor
since patch
was against version 1.6 rather than HEAD 1.7. Did not test
(Default
settings does not change crawler behavior).
* src/java/org/archive/crawler/frontier/AbstractFrontier.java
From Eric:

That part of the patch makes sure to only set seed
redirect
priorities to MEDIUM if they were previously NORMAL
(previously this was done unconditionally). This is
necessary since when the depth-first option is set, seed
redirects will be given HIGH priority before this so we
don't want to lower it to MEDIUM. As conjectured in my
initial comment, some may also want a seperate option to
make redirects (or just seed redirects) relatively one
higher priority than their parent (so they're crawled
immediately rather than waiting for the rest of the
seeds).
* src/java/org/archive/crawler/postprocessor/LinksScoper.java
(ATTR_PREFERENCE_DEPTH_HOPS): Added.
(getSchedulingFor): Amended API passing CrawlURI (Used
logging).



Date: 2006-08-04 14:57
Sender: ecjensen

Logged In: YES
user_id=705615

That part of the patch makes sure to only set seed redirect
priorities to MEDIUM if they were previously NORMAL
(previously this was done unconditionally). This is
necessary since when the depth-first option is set, seed
redirects will be given HIGH priority before this so we
don't want to lower it to MEDIUM. As conjectured in my
initial comment, some may also want a seperate option to
make redirects (or just seed redirects) relatively one
higher priority than their parent (so they're crawled
immediately rather than waiting for the rest of the seeds).


Date: 2006-07-25 15:29
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Good stuff Eric.

Patch looks good except, why did you do this:

+ if (curi.getSchedulingDirective() ==
CandidateURI.NORMAL)
+
curi.setSchedulingDirective(CandidateURI.MEDIUM);


I don't see how it improves over what was there previously.


Date: 2006-07-25 15:00
Sender: ecjensen

Logged In: YES
user_id=705615

As conjectured, promoting pre-req's to priorities 2 less
than their parent is unnecessary, 1 is fine. Fixed patch
attached.

Also, some might want redirects to be given priority 1 less
than their parent to ensure they're crawled before remaining
seeds (another aspect of depth-first), but this should
probably be a seperate option.


Attached File ( 1 )

Filename Description Download
depth_first_fixed.patch Download

Changes ( 8 )

Field Old Value Date By
close_date - 2006-08-25 18:53 stack-sf
status_id Open 2006-08-25 18:53 stack-sf
assigned_to nobody 2006-08-18 23:34 gojomo
artifact_group_id 1.8.0 2006-08-04 18:06 stack-sf
category_id None 2006-08-04 18:06 stack-sf
File Added 186178: depth_first_fixed.patch 2006-07-25 15:00 ecjensen
File Deleted 185045: 2006-07-25 14:56 ecjensen
File Added 185045: depth_first.patch 2006-07-16 05:01 ecjensen