Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 ExtractorCSS does not resolve relative URIs against BASE - ID: 1404316
Last Update: Comment added ( karl-ia )

If a document has BASE URL and relative links within
STYLE tag, ExtractorCSS does not resolve them against
BASE URL.

Example: http://www.democrats.com/search/node.

In this case this bug create an infinite URL trap.

By glancing at the code it seems to be an easy fix by
simple changing curi.createAndAddLink call at line 152
to
curi.createAndAddLinkRelativeToBase...


Igor Ranitovic ( ia_igor ) - 2006-01-12 23:07

8

Closed

None

Karl Thiessen

Extraction

1.8.0

Public


Comments ( 4 )

Date: 2007-03-14 01:04
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-534 -- please add further
comments at that location.


Date: 2006-05-05 00:07
Sender: karl-ia

Logged In: YES
user_id=1269624

Test with this number is in the harness; new bug with number
1482197 filed against similar behaviour.

This one is verified and verified fixed. Closing.


Date: 2006-03-03 03:43
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Further tests indicate that URLs in CSS files are always
calculated relative to containing CSS file -- *not* the HTML
document (or HTML document BASE) from where the CSS was
"<LINK REL=StyleSheet SRC=" or "@import" included.

So, of the 1-4 steps listed in my previous comment, neither
(2) nor (4) makes sense.

(1) is implementable by Igor's original suggestion; (3) is
implementable by handling the STYLE attribute specially.
I've implemented both.

Commit comment:
Fix for [ 1404316 ] ExtractorCSS does not resolve relative
URIs against BASE
* ExtractorCSS.java
derelativize URLs inside inline <STYLE> elements with
respect to declared document BASE
* ExtractorHTML.java
handle STYLE="" attributes like other CSS content

Assigning to Karl for verification/close.


Date: 2006-01-17 01:04
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

The easy fix would only work for inline styles -- then the
current CrawlURI has a BASE.

For CSS brought in by LINK/@HREF, the current CrawlURI is
that of the CSS file, which declares no base. Only the 'via'
is available... and unlike JS, we're not even doing that.

To carry forward the BASE, some sort of
attribute-inheritance as discussed in other issues (like
"1289245 n-hops-off decide rule / focus plus N hops
scoping") may be necessary.

Also, thinking of URIs in CSS, I beleive they could also
appear inside a STYLE attribute on any HTML tag -- a case we
currently don't handle.

So at least 4 improvements are justified (roughly in order
of increasing difficulty):
(1) Use BASE when available for full inline STYLE element
sheets;
(2) Use 'via' when available as base for standalone CSS
resources;
(3) Find URIs in contents of STYLE attributes (using BASE as
above);
(4) Carry-forward the BASE of the 'via' for use in
standalone CSS resources. (ExtractorJS should probably get
the same logic: everyplace it uses 'via' it should use the
'via's BASE if available.)




Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2006-05-05 00:07 karl-ia
close_date - 2006-05-05 00:07 karl-ia
artifact_group_id None 2006-03-17 23:13 gojomo
assigned_to gojomo 2006-03-03 03:43 gojomo
assigned_to nobody 2006-01-17 01:04 gojomo