Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

3 Parsing links found between escaped quotes in JavaScript - ID: 831480
Last Update: Comment added ( karl-ia )

Heritrix is not parsing correctly links found between
escaped quotes in JavaScript.

Example:
document.write("<a href=\"http://a.com/aPage.html\">
test </a><br>");

Expected result:
http://a.com/aPage.html
should be added to the list of discovered URLs.

Current result:
http://a.com/aPage.html%5C
is added to the list of discovered URLs.


Igor Ranitovic ( ia_igor ) - 2003-10-28 01:05

3

Closed

Fixed

Igor Ranitovic

Extraction

None

Public


Comments ( 3 )

Date: 2007-03-14 00:06
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-38 -- please add further
comments at that location.


Date: 2004-03-26 19:58
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

Fix replaced by new pattern that matches strings within
javascript code.




Date: 2004-01-07 01:51
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

I made a likely fix to this problem.
Regular expression used to parse possible urls from javascript
was matching possible urls found between single or double
quotes.
I modified the pattern so that matches possible urls between
escaped quotes as well:

(\\*"|\\*')(\.{0,2}[^+\.\n\r\s"']+[^\.\n\r\s"']*(\.[^\.\n\r\s"']
+)+)(\1)



Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-03-26 19:58 ia_igor
resolution_id None 2004-03-26 19:58 ia_igor
close_date - 2004-03-26 19:58 ia_igor
assigned_to nobody 2004-01-07 00:53 gojomo