Both Mercator and HTTrack seem toi get offsite embeds
not by noting in what HTML element the reference
occurred, but by looking for certain file extensions --
.gif, .jpg, etc. -- and always following those links, even
in <A HREF>s.
As a result, on "site focused" crawls, they've sometimes
retrieved hundreds or thousands of more "offsite
images".
While these don't appear "inside" in-focus pages, they
are often linked as if they were on the same site --
eg. "click here for larger image" -- and so usually should
be included in a site-focused capture.
Thus, Heritrix should have an option, by default on, to
specify a number of file-patterns which result in offsite
links being followed one hop, just like embeds.
Igor Ranitovic
Extraction
None
Public
|
Date: 2007-03-14 00:06
|
|
Date: 2004-03-26 19:44 Logged In: YES |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use