Menu

#17 Improper Handling of Relative Links in HTML Spider

open
nobody
None
5
2009-09-05
2009-09-05
S S
No

Relative links are not handled properly in the HTML Spider. Specifically...

Page URL:
http://blah.blahblah.com/some-thing_1/22,2341,0.html

Relative Link URL:
some-video-1.avi

becomes...
http://blah.blahblah.com/some-thing_1/22,2341,0.html/some-video-1.avi

instead of:
http://blah.blahblah.com/some-thing_1/some-video-1.avi

The desired behavior is to
check whether or not the link begins with a protocol (like http://, ftp://, https://, file:///, etc.)
if it does, it is NOT a relative link and can be taken as-is
if it does not, it IS and you should chop off everything after the last delimiter (i.e., "/") and append the link URL

[pseudo-code]
page_url = the url of the page that the link resides in
link_url := the link's url
relative := true
delimiter := the delimiter of url paths, /
protocols := [a set of protocols]
for each protocol in protocols
if link_url.startsWith( protocol )
then relative := false
break loop
if relative = true
then link_url := page_url.substring( 0, page_url.lastIndexOf( delimiter ) ) + link_url

Sorry if I stated a bunch of stuff you already knew. =)

Discussion


Log in to post a comment.