From: Antoni M. <ant...@gm...> - 2008-05-16 22:25:26
|
Grant Ingersoll pisze: > FWIW: > Index: src/java/org/semanticdesktop/aperture/crawler/web/WebCrawler.java > =================================================================== > --- src/java/org/semanticdesktop/aperture/crawler/web/ > WebCrawler.java (revision 1290) > +++ src/java/org/semanticdesktop/aperture/crawler/web/ > WebCrawler.java (working copy) > @@ -417,7 +417,9 @@ > //deprecatedUrls.add(url); > reportDeletedDataObject(url); > } else { > - accessData.remove(url); > + if (accessData != null) { > + accessData.remove(url); > + } > } > > // furthermore we should not list this object as accessed > any longer; when it can be accessed normally > > Seems to fix this immediate issue, but I don't know if there are any > other things associated with it. Thanks, I applied it, I should have spotted this one. > These lines around 168 of initialize() seem a bit funky to me, but I > don't fully grok the AccessData stuff: > > if (accessData == null) { > crawledUrls = new HashSet<String>(1024); > } else { > wad = new WebAccessData(accessData); > } The overall problem is that the crawler needs to mark the urls it has already crawled to prevent it from going in circles. in 1.0.1 there was no isTouched method in the AccessData and the crawler maintained a set of all crawledUrls. Now, the new accessData interface has the isTouched method that does exacly the same thing, without the memory overhead of a potentially VERY big set. The crawler still needs to work though even if there is no AccessData available, in this case it falls back to the previous behavior - this was tricky I spent quite some time tracking NPE's related to this but it seems that not enough. I've been thinking about using some lightweight jetty-or-something container and implementing some test war apps for use in unit tests of the WebCrawler. It wouldn't take too much to write a servlet that would test all kinds of bizzarre http redirection patterns or serve some broken html, but that's more a song of the future. Antoni Mylka ant...@gm... |