Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

9 Mass-add URIs to running crawl and force reconsideration - ID: 939679
Last Update: Comment added ( karl-ia )

This almost slipped through the cracks: we need a
facility to add new seeds to an in-progress (thouh
possibly paused) crawl.

Shouldn't be too hard, except perhaps for updating
seed-derived scopes in a clean manner.


Gordon Mohr ( gojomo ) - 2004-04-21 22:52

9

Closed

None

Gordon Mohr

None

None

Public


Comments ( 7 )

Date: 2007-03-14 01:29
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-763 -- please add further
comments at that location.


Date: 2004-07-31 00:11
Sender: nobody

Logged In: NO

Here are comments on my testing of this feature from
http://crawler.archive.org/cgi-bin/wiki.pl?CrawlTestPlan.
It works.

Ran a small crawl. Ran a new crawl. Paused it. Tried to
import a file whose path was wrong. Got message in UI and
exception in heritrix_out.log about FileNotFound. Tried to
import first crawls' recover.gz. It failed (Exception in
heritrix_out.log complaining about parse) but UI just said
zero imported. Ungzipped recover log and retried. Said
successfully imported 17 URIs. Started up the paused crawl.
Let it run a while. Looked in crawl.logs. Saw that the
unique subset of the 17 URIs imported got crawled. This
works. Tested crawl.log. Imported 5 URIs. Mentioned items
were crawled. Noticed that if the 'force revisit' was not
checked, then we got -5000s in the logs (Out-of-scope).
This seems right. This feature works.


Date: 2004-07-30 04:55
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Also added mention of capability to relevant paragraph of
user-manual. Closing.


Date: 2004-07-30 01:09
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Implementation of [ 939679 ] Mass-add URIs to running crawl
and force reconsideration
* main.jsp, frontier.jsp
Expand option which appears for paused crawls to
offer ability to add URIs from file. File may contain 1
URI per line, or be in crawl.log or recoveryJournal formats
(for the case where it has been culled from those sources).
Operator may force URIs to be revisited even if they've
already been marked as included.
* CrawlJobHandler.java
Support method for frontier.jsp's import URI feature.



Date: 2004-07-20 19:08
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Most basic way to do this: from somewhere near the 'inspect
queues' option (that appears when the crawler is paused),
allow operator to supply a file, one URI per line, of URIs
to be scheduled. These should NOT be treated as true seeds,
and should be rescheduled even if alreadyIncluded. (The
existing 'force-fetch' facility might work for this; or it
might be desirable to offer another route around the
alreadyIncluded check, or edit the alreayIncluded to remove
these items then re-add them... unsure.)

One of the most-common uses of this facility may be to retry
URIs that failed early in the crawl due to partial outages.
This suggests another fancier option: offer a way to
'resurrect' failures from the crawl.log, possibly even
restoring 'via' and link-path info. There could be an option
to resurrect-all or resurrect-by-regexp.


Date: 2004-04-23 22:25
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

We do have this, if one edits the seed list, but that may
not be ideal for large adds. Also, if a group of URIs had
failed early in the crawl, but the operator wants them to be
retried in the context of the crawl, there should be a way
to add-and-force them to be treated as new.

So, I've updated the 'Summary' to reflect this more advanced
need.


Date: 2004-04-22 01:25
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Or, can we do this by editting the seed list mid-crawl?


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2004-07-30 04:55 gojomo
close_date - 2004-07-30 04:55 gojomo
assigned_to nobody 2004-07-22 22:08 gojomo
assigned_to gojomo 2004-07-20 19:08 gojomo
assigned_to nobody 2004-07-20 17:45 gojomo
priority 5 2004-07-07 22:02 gojomo
summary SM14/SM15 Add URIs to running crawl 2004-04-23 22:25 gojomo