Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#137 Redirect fails for initial URL

open
nobody
crawlers (23)
5
2010-12-27
2010-12-27
Jack Krupansky
No

Aperture does have logic to properly handle URL redirects, but it fails for the special case of the initial URL. I have traced through the code and see that the redirect is being processed, but then there is special logic that causes the redirected page not to be processed in the case where the URL was the initial URL.

This is easy to reproduce using the webcrawler example and an initial URL of http://cnn.com which stops at that first page, while an initial URL of http://www.cnn.com proceeds to crawl all links from the initial redirected page.

Note: A lot of sites will be crawled fine since they use a stealth redirect that Aperture won't even see. This bug is for sites that use a traditional 301 redirect.

Attached is Windows console output showing the output for both cnn.com and www.cnn.com.

I tested using 1.5, but this same problem existed in 1.4.

Discussion

  • Jack Krupansky
    Jack Krupansky
    2010-12-28

    Oops... let me put this issue on hold while I research it some more. It turns out that ExampleWebCrawler defaults to an include boundary that matches the original URL which of course will not tend to match a redirected URL. If a more permissive include is given on the command line (e.g., "-include .*"), the crawl does indeed proceed beyond the initial URL web page.

    That said, I have seen the redirect problem in a real app... I am just trying to reproduce it with a simpler test case. I'll update this issue when I identify the more precise test case scenario.

     
  • Jack Krupansky
    Jack Krupansky
    2010-12-28

    Okay, I sorted out the mystery... this bug relates to how the triplestore is used, so the "--accessDataStore" option is needed on the command line of the webcrawler example to reproduce the true problem. So, my full command line (on Windows) is:

    webcrawler.bat -v --accessDataStore ds -include .* http://cnn.com

    NOTE: Make sure to delete "ds" each run, otherwise it will only fail on the first run and then success on subsequent runs.

    The essence of the problem is that ModelAccessData.put also adds the "timestamp" that indicates that the URL has been "touched", This occurs when HttpAccessor.UpdateAccessData is adding the "date" triple, as part of getDataObjectIfModified for the redirected URL, in turn called from WebCrawler.processQueue. Then, when processQueue calls isCrawled for the redirected URL, "touched" is true because of that date side effect statement, causing the page to be ignored and then the outer loop ends since no additional pages were added.

    I'm still trying to get my head around all of these levels of processing, so I don't have a proposed fix at this time. It may just be a line or two or three of code, but that could require a fair number of use cases to be tested.