Menu

#55 Page request limit 1 doesn't redirect to 301 site

closed-duplicate
None
5
2014-01-18
2013-11-08
adamtk
No

With follow redirect until content enabled, should the crawler request for the 301 automatically go to the 301 URL, when the page limit is to 1?
At the moment, my Crawler returns the results of the redirect page, rather than following the link and returning that page:
{"crawler_datetime":"11\/09\/2013 07:37:56 am","crawler_datetime_timezone":"Sydney\/Adelaide","crawler_results":[{"http:\/\/www.apple.com.au\/":{"url":"http:\/\/www.apple.com.au\/","title":"Found","http_status_code":301,"meta_attributes":[],"response_header":{"0":"HTTP\/1.1 301 Moved Permanently","Age":"82","Date":"Fri, 08 Nov 2013 21:13:00 GMT","Connection":"Keep-Alive","Via":"NS-CACHE-9.2: 1","ETag":"\"KXMDAJBOCGQOLSLVS\"","Server":"somethingNice.","Referer":"http:\/\/apple.com\/","Location":"http:\/\/www.apple.com\/au\/","Content-type":"text\/html","Content-length":"295"},"referer_url":null,"bytes_received":295,"links":{"apple.com":{"links":[{"url":"http:\/\/www.apple.com\/au\/","link_text":"","page_code":"","is_redirect_url":true}],"links_displayed":1,"links_found":1}},"domains_found":1,"domains_displayed":1,"links_found":1,"links_displayed":1,"errors":null}}]}
I tested with a basic version of the Crawler and it does the same thing. If I increase the page limit, it will start scraping the new site.
Is this expected behaviour, or is there a way to get the initial request to crawl the redirected URL in the same request?

Reply:
Ok, i see what you mean, good point!
Since its called "setPageLimit" and not "serRequestLimit" i think you r right.
But i have to check that.
Could you maybe open another bugreport for this to keepp things seperated?
Thanks! And thanks for the report btw!

Discussion

  • adamtk

    adamtk - 2013-11-08

    Perhaps add an option for setPageRequestLimit, so you could set the number of redirects it will follow in the same request?

     
  • Anonymous

    Anonymous - 2013-11-13

    Hi adamtk,

    just to let you know: i can't answer your email, sf.net says "user unknown" and the mail could not be delivered. Strange.

     
  • adamtk

    adamtk - 2013-11-13

    mmm, strange indeed. Will have to check the mail logs.

    Did you want to send a message through http://www.siteinquisitor.com/contact ? Then I can reply with my email address.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-01-18

    Just moved this to a clean, new bugreport, see https://sourceforge.net/p/phpcrawl/bugs/61/

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-01-18
    • status: open --> closed-duplicate
     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.