#39 WebCrawler needs correct handling of faulty urls in links

closed
nobody
None
5
2008-05-23
2008-04-28
Anonymous
No

A problem has been reported:

I'm seeing a weird issue when crawling a website. I pointed the
WebCrawler at http://www.condenast.com and am getting:

java.lang.IllegalArgumentException: java.net.URISyntaxException:
Illegal character in path at index 70: https://www.magazinestoresubscriptions.com/webapp/wcs/stores/servlet/'
+
[java] url +
[java] '
[java] at
org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:51)
[java] at
org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:36)
[java] at
org
.ontoware.rdf2go.model.impl.AbstractModel.createURI(AbstractModel.java:
139)
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processLinks
(WebCrawler.java:544)
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processQueue
(WebCrawler.java:320)
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.crawlObjects
(WebCrawler.java:135)
[java] at
org
.semanticdesktop
.aperture.crawler.base.CrawlerBase.crawl(CrawlerBase.java:197)

I had a crawl depth of 2. My guess is that the crawler isn't handling
<SCRIPT> tags correctly (i.e. it should not try to extract links from
<SCRIPT> tags). Is that what you guys see?

From the looks of HTMLLinkExtractor, it seems like it should be fine,
as I don't see SCRIPT in the tags that are handled, so I must be
missing something.

The link is hit by clicking the "Subscribe" link on the main landing
page and then the bad link shows up on the ensuing page.

The failure seems to bubble all the way up to my app and kills the
crawl. Does Aperture have a way of either logging the failure and
continuing on, or, at least returning a set of failed links? I think
either normalizeLink needs more checking, or we need to deal better
with the exception in createURI()

Also, I noticed when debugging that the HTMLLinkExtractor "links"
member var is a List, but I also noticed when crawling that site that
there are duplicates in the list. Do duplicates get removed later?

The condenast website has been fixed - the link extractor ignores any tags that occur inside <SCRIPT> </SCRIPT>. But the problem itself may not have been fixed. The issue is to investigate the treatment of faulty urls in links and prevent them from breaking the crawl.

Discussion

  • Logged In: YES
    user_id=1917080
    Originator: NO

    This is now fixed.

     
  • Antoni Mylka
    Antoni Mylka
    2008-05-23

    • status: open --> closed
     
  • Antoni Mylka
    Antoni Mylka
    2008-05-23

    Logged In: YES
    user_id=1613065
    Originator: NO

    If it's fixed, i'm closing this issue.