#105 Redirected Web Documents Removed From AccessData in Recrawls

1.4.0 - bugs
closed-fixed
crawlers (23)
5
2009-09-25
2009-09-18
No

Description:
Web documents that are only reachable via a redirect are removed from AccessData objects in the org.semanticdesktop.aperture.crawler.web.WebCrawler if they are recrawled but have not changed since the previous crawl.

Cause:
The org.semanticdesktop.aperture.crawler.web.WebCrawler.addCrawled(String) method is called for every crawled URL. However, if the URL is a redirect, then the method is not called for the redirection target URL if the target document has not been changed since the last crawl. The method marks (org.semanticdesktop.aperture.accessor.AccessData#touch(String)) touched/crawled documents which is consequently not done for redirection target documents.

Solution:
The attached patch always touches documents which are the redirection targets of crawled redirects, too. Therewith redirection targets are not removed from AccessData objects anymore and consequently they are not recrawled in subsequent crawls (unless their content changes).

Discussion

  • Christian Spurk

    Christian Spurk - 2009-09-22

    The previous patch introduced the problem that recrawling of changed redirection targets doesn’t work anymore as these documents were already touched … The new patch that I’ll attach in a minute should fix the original issue without this regression: it only touches the redirection target when it is clear that the document hasn’t changed.

     
  • Christian Spurk

    Christian Spurk - 2009-09-22

    patch fixing the issue without introducing the regression mentioned in my last comment

     
  • Antoni Mylka

    Antoni Mylka - 2009-09-25

    applied in rev 2080, thanks very much

     
  • Antoni Mylka

    Antoni Mylka - 2009-09-25
    • milestone: 893322 --> 1.4.0 - bugs
    • assigned_to: nobody --> mylka
    • status: open --> closed-fixed
     

Log in to post a comment.