Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 Aggressive extraction of `for' attributes - ID: 1181892
Last Update: Comment added ( karl-ia )

We got a complaint for a Webmaster that we are creating
a lot of false requests since we are treating values of
HTML FOR attribute as relative URLs.

Since a value of FOR attribute can be a relative URI we
should probably continue to extract them. However, in
order to reduce bad requests we should probably apply
LIKELY_URI_PATH rule to determine if a FOR value is
likely to be a relative URL.


---A note from a webmaster ---
However, there seems to be a problem with the spider's
interpretation
of certain HTML constructs, particularly the 'for'
attribute of
'label' elements. Apparently the spider interprets
these attributes as
relative links and subsequently tries to retrieve those
from the
server. This, of course, doesn't make sense since these
attributes are
meant to refer to elements inside the document, not
external resources
(see
http://www.w3.org/TR/html4/interact/forms.html#edef-LABEL).



Igor Ranitovic ( ia_igor ) - 2005-04-12 23:38

7

Closed

Fixed

Karl Thiessen

Extraction

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:22
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-395 -- please add further
comments at that location.


Date: 2005-06-09 19:49
Sender: karl-ia

Logged In: YES
user_id=1269624

Test in harness, bug verified on previous builds. Absence
of bug verified in current build (Heritrix
1.5.0-200506091041) -- closing bug.


Date: 2005-06-09 17:40
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

This section of the HTML4 DTD had given me the impression
'for' could wind up being used on SCRIPT tags for some URI
purpose:

<!ATTLIST SCRIPT
charset %Charset; #IMPLIED -- char encoding of
linked resource --
type %ContentType; #REQUIRED -- content type of
script language --
src %URI; #IMPLIED -- URI for an
external script --
defer (defer) #IMPLIED -- UA may defer
execution of script --
event CDATA #IMPLIED -- reserved for
possible future use --
for %URI; #IMPLIED -- reserved for
possible future use --
>

However, I don't think we've ever seen such usage, and we've
received multiple complaints for 'for' attribute extraction
as relative URIs. So, removing that extraction from the regexp.

Commit comment:

Fix for [ 1181892 ] Aggressive extraction of `for' attributes
* ExtractorHTML.java
remove 'for' attribute-match from extractor regexp

Considerd fixed, assigning to Karl for verification.


Date: 2005-04-15 01:30
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

See also '[ 1075982 ] Overeager extraction can lead to
side-effects for admins.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:27 gojomo
close_date - 2005-06-09 19:49 karl-ia
status_id Open 2005-06-09 19:49 karl-ia
resolution_id None 2005-06-09 19:49 karl-ia
assigned_to gojomo 2005-06-09 17:40 gojomo
priority 5 2005-06-09 17:26 gojomo
assigned_to nobody 2005-06-09 17:26 gojomo