Both Mercator and HTTrack seem toi get offsite embeds
not by noting in what HTML element the reference
occurred, but by looking for certain file extensions --
.gif, .jpg, etc. -- and always following those links, even
in <A HREF>s.
As a result, on "site focused" crawls, they've sometimes
retrieved hundreds or thousands of more "offsite
images".
While these don't appear "inside" in-focus pages, they
are often linked as if they were on the same site --
eg. "click here for larger image" -- and so usually should
be included in a site-focused capture.
Thus, Heritrix should have an option, by default on, to
specify a number of file-patterns which result in offsite
links being followed one hop, just like embeds.
Igor Ranitovic
Extraction
None
Public
|
Date: 2007-03-14 00:06
|
|
Date: 2004-03-26 19:44 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-03-26 19:44 | ia_igor |
| close_date | - | 2004-03-26 19:44 | ia_igor |
| category_id | None | 2004-02-17 22:38 | gojomo |
| priority | 5 | 2004-02-17 22:38 | gojomo |
| assigned_to | nobody | 2004-02-17 22:38 | gojomo |