A problem has been reported:
I'm seeing a weird issue when crawling a website. I pointed the
WebCrawler at http://www.condenast.com and am getting:
Illegal character in path at index 70: https://www.magazinestoresubscriptions.com/webapp/wcs/stores/servlet/'
[java] url +
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processLinks
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processQueue
[java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.crawlObjects
I had a crawl depth of 2. My guess is that the crawler isn't handling
<SCRIPT> tags correctly (i.e. it should not try to extract links from
<SCRIPT> tags). Is that what you guys see?
From the looks of HTMLLinkExtractor, it seems like it should be fine,
as I don't see SCRIPT in the tags that are handled, so I must be
The link is hit by clicking the "Subscribe" link on the main landing
page and then the bad link shows up on the ensuing page.
The failure seems to bubble all the way up to my app and kills the
crawl. Does Aperture have a way of either logging the failure and
continuing on, or, at least returning a set of failed links? I think
either normalizeLink needs more checking, or we need to deal better
with the exception in createURI()
Also, I noticed when debugging that the HTMLLinkExtractor "links"
member var is a List, but I also noticed when crawling that site that
there are duplicates in the list. Do duplicates get removed later?
The condenast website has been fixed - the link extractor ignores any tags that occur inside <SCRIPT> </SCRIPT>. But the problem itself may not have been fixed. The issue is to investigate the treatment of faulty urls in links and prevent them from breaking the crawl.