Heritrix currently only crawls breadth-first, in that
it crawls all seeds before any discovered links.
Attached is a patch that adds an option to schedule
discovered links (up to an optional depth from a seed)
before remaining seeds to facilitate depth-first
search. It changes redirect and depth-first priorities
to HIGH, and makes pre-requisites 2 less than their
parent URI's priority (not sure if this is necessary).
Looking at the code that uses priority scheduling, I
believe this will work fine in any configuration except
for with the AdaptiveRevisit frontier. I have not
looked into whether it is compatible with that.
Michael Stack
Configuration
1.10.0
Public
|
Date: 2007-03-14 01:48
|
|
Date: 2006-08-25 18:53 Logged In: YES |
|
Date: 2006-08-18 23:34 Logged In: YES |
|
Date: 2006-08-04 18:06 Logged In: YES |
|
Date: 2006-08-04 14:57 Logged In: YES |
|
Date: 2006-07-25 15:29 Logged In: YES |
|
Date: 2006-07-25 15:00 Logged In: YES |
| Filename | Description | Download |
|---|---|---|
| depth_first_fixed.patch | Download |
| Field | Old Value | Date | By |
|---|---|---|---|
| close_date | - | 2006-08-25 18:53 | stack-sf |
| status_id | Open | 2006-08-25 18:53 | stack-sf |
| assigned_to | nobody | 2006-08-18 23:34 | gojomo |
| artifact_group_id | 1.8.0 | 2006-08-04 18:06 | stack-sf |
| category_id | None | 2006-08-04 18:06 | stack-sf |
| File Added | 186178: depth_first_fixed.patch | 2006-07-25 15:00 | ecjensen |
| File Deleted | 185045: | 2006-07-25 14:56 | ecjensen |
| File Added | 185045: depth_first.patch | 2006-07-16 05:01 | ecjensen |