Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

5 domain scope leakage - ID: 955527
Last Update: Comment added ( karl-ia )

If we have a seed such as
www[0-9]*.a.com all URLs from a.com host should be
within domain scope but they are not.


Igor Ranitovic ( ia_igor ) - 2004-05-17 23:47

5

Closed

Fixed

Igor Ranitovic

General

None

Public


Comments ( 2 )

Date: 2007-03-14 00:11
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-144 -- please add further
comments at that location.


Date: 2004-05-20 18:39
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

Another issue with domain scope that causes scope-leakage
(crawling of unwanted hosts.) All discovered hosts that have
same ending as any of seeds are crawled.
If we have seed http://a.com all hosts that end with a.com
will be crawled:
www.aa.com
www.abba.com
www.baba.com

This not desirable behavior.
Needed fix so that only a.com, www[0-9]*.a.com, and all
other hosts that end with .a.com will be crawled.




Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-06-04 17:37 stack-sf
resolution_id None 2004-06-04 17:37 stack-sf
summary domain scope problem 2004-06-04 17:37 stack-sf
close_date - 2004-06-04 17:37 stack-sf