With follow redirect until content enabled, should the crawler request for the 301 automatically go to the 301 URL, when the page limit is to 1?
At the moment, my Crawler returns the results of the redirect page, rather than following the link and returning that page:
{"crawler_datetime":"11\/09\/2013 07:37:56 am","crawler_datetime_timezone":"Sydney\/Adelaide","crawler_results":[{"http:\/\/www.apple.com.au\/":{"url":"http:\/\/www.apple.com.au\/","title":"Found","http_status_code":301,"meta_attributes":[],"response_header":{"0":"HTTP\/1.1 301 Moved Permanently","Age":"82","Date":"Fri, 08 Nov 2013 21:13:00 GMT","Connection":"Keep-Alive","Via":"NS-CACHE-9.2: 1","ETag":"\"KXMDAJBOCGQOLSLVS\"","Server":"somethingNice.","Referer":"http:\/\/apple.com\/","Location":"http:\/\/www.apple.com\/au\/","Content-type":"text\/html","Content-length":"295"},"referer_url":null,"bytes_received":295,"links":{"apple.com":{"links":[{"url":"http:\/\/www.apple.com\/au\/","link_text":"","page_code":"","is_redirect_url":true}],"links_displayed":1,"links_found":1}},"domains_found":1,"domains_displayed":1,"links_found":1,"links_displayed":1,"errors":null}}]}
I tested with a basic version of the Crawler and it does the same thing. If I increase the page limit, it will start scraping the new site.
Is this expected behaviour, or is there a way to get the initial request to crawl the redirected URL in the same request?
Reply:
Ok, i see what you mean, good point!
Since its called "setPageLimit" and not "serRequestLimit" i think you r right.
But i have to check that.
Could you maybe open another bugreport for this to keepp things seperated?
Thanks! And thanks for the report btw!
Anonymous
Perhaps add an option for setPageRequestLimit, so you could set the number of redirects it will follow in the same request?
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Hi adamtk,
just to let you know: i can't answer your email, sf.net says "user unknown" and the mail could not be delivered. Strange.
mmm, strange indeed. Will have to check the mail logs.
Did you want to send a message through http://www.siteinquisitor.com/contact ? Then I can reply with my email address.
Just moved this to a clean, new bugreport, see https://sourceforge.net/p/phpcrawl/bugs/61/