Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

3 links to likely-embed types should be treated as embeds - ID: 791481
Last Update: Comment added ( karl-ia )

Both Mercator and HTTrack seem toi get offsite embeds
not by noting in what HTML element the reference
occurred, but by looking for certain file extensions --
.gif, .jpg, etc. -- and always following those links, even
in <A HREF>s.

As a result, on "site focused" crawls, they've sometimes
retrieved hundreds or thousands of more "offsite
images".

While these don't appear "inside" in-focus pages, they
are often linked as if they were on the same site --
eg. "click here for larger image" -- and so usually should
be included in a site-focused capture.

Thus, Heritrix should have an option, by default on, to
specify a number of file-patterns which result in offsite
links being followed one hop, just like embeds.


Gordon Mohr ( gojomo ) - 2003-08-19 20:21

3

Closed

None

Igor Ranitovic

Extraction

None

Public


Comments ( 2 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-30 -- please add further
comments at that location.


Date: 2004-03-26 19:44
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

Added addition focus filter that is configurable.
By default the following file extensions are within the
crawls focus:
.avi
.bmp,
.doc
.gif
.jp(e)g
.mid
.mov
.mp2
.mp3
.mp4
.mpeg
.pdf
.png
.ppt
.ram
.rm
.smil
.swf
.tif(f)
.wav
.wmv


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2004-03-26 19:44 ia_igor
close_date - 2004-03-26 19:44 ia_igor
category_id None 2004-02-17 22:38 gojomo
priority 5 2004-02-17 22:38 gojomo
assigned_to nobody 2004-02-17 22:38 gojomo