From: Antoni M. <ant...@gm...> - 2008-04-28 20:43:11
|
Grant Ingersoll pisze: > I'm seeing a weird issue when crawling a website. I pointed the > WebCrawler at http://www.condenast.com and am getting: > > java.lang.IllegalArgumentException: java.net.URISyntaxException: > Illegal character in path at index 70: https://www.magazinestoresubscriptions.com/webapp/wcs/stores/servlet/' > + > [java] url + > [java] ' > [java] at > org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:51) > [java] at > org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:36) > [java] at > org > .ontoware.rdf2go.model.impl.AbstractModel.createURI(AbstractModel.java: > 139) > [java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processLinks > (WebCrawler.java:544) > [java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.processQueue > (WebCrawler.java:320) > [java] at org.semanticdesktop.aperture.crawler.web.WebCrawler.crawlObjects > (WebCrawler.java:135) > [java] at > org > .semanticdesktop > .aperture.crawler.base.CrawlerBase.crawl(CrawlerBase.java:197) > > I had a crawl depth of 2. My guess is that the crawler isn't handling > <SCRIPT> tags correctly (i.e. it should not try to extract links from > <SCRIPT> tags). Is that what you guys see? > > From the looks of HTMLLinkExtractor, it seems like it should be fine, > as I don't see SCRIPT in the tags that are handled, so I must be > missing something. > > The link is hit by clicking the "Subscribe" link on the main landing > page and then the bad link shows up on the ensuing page. > > The failure seems to bubble all the way up to my app and kills the > crawl. Does Aperture have a way of either logging the failure and > continuing on, or, at least returning a set of failed links? I think > either normalizeLink needs more checking, or we need to deal better > with the exception in createURI() > > Also, I noticed when debugging that the HTMLLinkExtractor "links" > member var is a List, but I also noticed when crawling that site that > there are duplicates in the list. Do duplicates get removed later? > > > Thanks, > Grant > 1. I fixed this particular problem. The link extractor ignores any "tags" that appear inside <SCRIPT>. Added a unit test in the HtmlLinkExtractorTest.java 2. I didn't touch the WebCrawler code. The issue you mention is valid indeed. I created a sourceforge issue: <http://tinyurl.com/6oxlsy>. All kinds of comments welcome Antoni Mylka ant...@gm... |