I have noticed that Aperture does not handle HTTP redirects properly in case of persistent incremental crawl (when AccessData interface is implemented by ModelAccessData).
I think that ModelAccessData does not properly implement resource "touching", i.e.
* touch / isTouched is based on keeping/retrieving timestamp field from permanent storage
* but every put method call is adding timestamp field as well (and therefore "touches" resource), which is not aligned with other AccessData implementations ( i.e. AccessDataImpl where put doesn't store touch attribute (timestamp key with crawlIdentifier value) in memory structure )
There's _at least_ one use case where you don't want "touch" resource while you putting some metadata in AccessData storage:
HTTPAccessor.get can process redirects and updating AccessData storage for target redirect pages in updateAccessData call (212 line).
In this case you might want to add DATE metadata for target redirect page but you don't want to consider this page as touched because
- there's preventive check in WebCrawler.processQueue (line 350) that checks whether this page has been already reported to CrawlerHandler and should be ignored
As a result of these conditions Aperture doesn't report about http redirects.
I'm proposing to remove undesirable touching in ModelAccessData.put ( see patch )
Log in to post a comment.