Re: [Aperture-devel] LinkExtraction

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Grant Ingersoll pisze:
> I'm seeing a weird issue when crawling a website.  I pointed the  
> WebCrawler at http://www.condenast.com and am getting:
> 
> java.lang.IllegalArgumentException: java.net.URISyntaxException:  
> Illegal character in path at index 70: https://www.magazinestoresubscriptions.com/webapp/wcs/stores/servlet/' 
>   +
>       [java]           url +
>       [java]           '
>       [java]     at  
> org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:51)
>       [java]     at  
> org.ontoware.rdf2go.model.node.impl.URIImpl.<init>(URIImpl.java:36)
>       [java]     at  
> org 
> .ontoware.rdf2go.model.impl.AbstractModel.createURI(AbstractModel.java: 
> 139)
>       [java]     at org.semanticdesktop.aperture.crawler.web.WebCrawler.processLinks 
> (WebCrawler.java:544)
>       [java]     at org.semanticdesktop.aperture.crawler.web.WebCrawler.processQueue 
> (WebCrawler.java:320)
>       [java]     at org.semanticdesktop.aperture.crawler.web.WebCrawler.crawlObjects 
> (WebCrawler.java:135)
>       [java]     at  
> org 
> .semanticdesktop 
> .aperture.crawler.base.CrawlerBase.crawl(CrawlerBase.java:197)
> 
> I had a crawl depth of 2.  My guess is that the crawler isn't handling  
> <SCRIPT> tags correctly (i.e. it should not try to extract links from  
> <SCRIPT> tags).  Is that what you guys see?
> 
>  From the looks of HTMLLinkExtractor, it seems like it should be fine,  
> as I don't see SCRIPT in the tags that are handled, so I must be  
> missing something.
> 
> The link is hit by clicking the "Subscribe" link on the main landing  
> page and then the bad link shows up on the ensuing page.
> 
> The failure seems to bubble all the way up to my app and kills the  
> crawl.  Does Aperture have a way of either logging the failure and  
> continuing on, or, at least returning a set of failed links?  I think  
> either normalizeLink needs more checking, or we need to deal better  
> with the exception in createURI()
> 
> Also, I noticed when debugging that the HTMLLinkExtractor "links"  
> member var is a List, but I also noticed when crawling that site that  
> there are duplicates in the list.  Do duplicates get removed later?
> 
> 
> Thanks,
> Grant
> 

1. I fixed this particular problem. The link extractor ignores any 
"tags" that appear inside <SCRIPT>. Added a unit test in the 
HtmlLinkExtractorTest.java
2. I didn't touch the WebCrawler code. The issue you mention is valid 
indeed. I created a sourceforge issue:
<http://tinyurl.com/6oxlsy>.

All kinds of comments welcome

Antoni Mylka
ant...@gm...