Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 seeds held back/poor breadth first? - ID: 809567
Last Update: Comment added ( karl-ia )

When supplying multiple seeds from the same site, it
appears that links discovered on the site may be
crawled before all the seeds are. This is undesirable
(except perhaps in the case of clear embedded
IMG/OBJ/FRAME/etc) -- within a site should be as
breadth-first as possible. Investigate and correct.


Gordon Mohr ( gojomo ) - 2003-09-19 23:47

5

Closed

Fixed

Gordon Mohr

None

None

Public


Comments ( 4 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-34 -- please add further
comments at that location.


Date: 2004-03-29 23:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Believed fixed; scheduleHigh() of all seeds ensures they are
handled ahead of all but the most urgent (prerequisite)
discovered URIs.


Date: 2004-02-24 21:36
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Seeds are now scheduleHigh()'d... other changes pending
other Frontier refactoring.


Date: 2004-02-20 02:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

No longer the case -- because embeds go to back of host
queue. But could recur when embeds are fetched at high
priority again.

Experimenting with idea that all seeds should be
scheduleHigh()d.

However, I see that scheduleHigh may not always have desired
effect; goes behind all other already host-enqueued URIs --
which might be a lot.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-03-29 23:16 gojomo
close_date - 2004-03-29 23:16 gojomo
resolution_id None 2004-03-29 23:16 gojomo
assigned_to nobody 2004-02-17 22:38 gojomo