#32 Can not crawl some website

closed-duplicate
nobody
None
5
2012-12-17
2012-12-07
Anonymous
No

Hi , the library is working well for most of website but i can not crawl this store www.uncommongoods.com it always be stopped after crawled 3 URLS of this store.

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2012-12-08

    Hi!

    Didn't test it so far, but did you try to increase the stream-timeot and connection-timeout?
    (See the FAQ, first one: http://phpcrawl.cuab.de/faq.html\)

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2012-12-15
     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2012-12-15

    Hi , my crawl config always be set
    $crawler->setConnectionTimeout(500);
    $crawler->setStreamTimeout(500);

    But it doesn't work

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-12-17

    Ok, i figured it out.

    For the firts page (www.uncommongoods.com) it's just that a "Acceppt"-directive is missing in the header phpcrawl sends. The hosting webserver doesn't deliver anything wihtout this header (content-length: 0 bytes).

    Please use the attached file/class as patch (simply put it in your "libs"-path of the phpcrawl package and overwrite the existing one).

    For the second page it's a little more complicated, dont't have a solution up to now.
    It starts with a request for http://www.bobbyberkhome.com/, a redirect to some other sites (where an authorization cookie is send) and then a redirct BACK to http://www.bobbyberkhome.com/. But http://www.bobbyberkhome.com/ already was crawled and so the crawler stops at this piont.

    Have to think about a solution.

    I'm closind this bugreport and will opnen two sepearate ones.

    THANKS for the report!

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-12-17
    • status: open --> closed-duplicate
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-12-17

    Patched class

     


Anonymous

Cancel  Add attachments