Web documents that are only reachable via a redirect are removed from AccessData objects in the org.semanticdesktop.aperture.crawler.web.WebCrawler if they are recrawled but have not changed since the previous crawl.
The org.semanticdesktop.aperture.crawler.web.WebCrawler.addCrawled(String) method is called for every crawled URL. However, if the URL is a redirect, then the method is not called for the redirection target URL if the target document has not been changed since the last crawl. The method marks (org.semanticdesktop.aperture.accessor.AccessData#touch(String)) touched/crawled documents which is consequently not done for redirection target documents.
The attached patch always touches documents which are the redirection targets of crawled redirects, too. Therewith redirection targets are not removed from AccessData objects anymore and consequently they are not recrawled in subsequent crawls (unless their content changes).
Log in to post a comment.