Strange results -- Documents Recieved

Help
2012-11-08
2013-04-09
  • Hi, Thanks for sharing your script and providing support.

    I'm having a bit of a problem crawling some sites such as http://www.brendansadventures.com and larger blogs.
    While smaller sites will give a constant figure in the links followed and document received,  when I  crawl the site above for example  the results vary widely.

    Below are results crawls I submitted all within 1 hour  of each other

    http://www.brendansadventures.com.

    Documents received: 353
    Documents received: 282
    Documents received: 1547
    Documents received: 329
    Documents received: 323

    I've been trying to figure out why it does this but to no avail. I've even tried turning off  setPageLimit
    Do you have an idea as to why it exhibits this behavior?

    My crawl settings

    $crawler = new MyCrawler();
    $crawler->setURL($domain);
    $crawler->addContentTypeReceiveRule("#text/html#");
    //$crawler->addURLFilterRule("#(jpg|jpeg|gif|png|css|pdf)$# i");
    //$crawler->setPageLimit(10000,false);
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css|pdf)$# i");
    $crawler->enableCookieHandling(true);
    $crawler->setWorkingDirectory("/dev/shm/");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->enableAggressiveLinkSearch(false);
    //$crawler->obeyRobotsTxt(true);
    //$crawler->setConnectionTimeout(8);
    $crawler->setFollowRedirects(true);
    $crawler->setStreamTimeout(5);
    $crawler->setLinkExtractionTags(array("href"));
    $crawler->goMultiProcessed(8, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

     
  • Hi!

    I din't test you rsetup so far, but did you try to increase the timeout-values (connection-timeout and stream-timeout)?
    And did you try to lower the number of processes the crawler should use?

    Sometimes the hosting webserver is just to "weak" or busy  to handle the amount of requests the crawler sends and returns some "501"s, in that case lowering the number of processes should help.

    Let me know if this works for you.

     
  • OK had a try no luck,  reducing the number of simultaneous crawls caused the script to time out and tried reducing
    also tried reducing setConnectionTimeout, this didn't have any noticable effect.

    • Let me know what else you suggest
     
  • Ok, i will take a deeper look later on.

    Do you know how many pages the site http://www.brendansadventures.com contains all in all (and the crawler
    should have received at the end)? At least 1547 i guess.

    I've encountered some websites (and servers) that seem to have some kind of "webcrawler-protection" and limit
    the number of requests within a spicified period from the same  ip-adress, but then the number of received documents
    shouldn't vary that much like in your case …

    I'll let you know if i know more.

    What version of PHP and phpcrawl are you using btw?

    Best Regards!

     
  • Thanks for looking into this for me
    I'm using PHP Version 5.3.14

    Really want to use it to track the growth of a site but on some occasions.. the values returned are not consistent.

    Also, $report->files_received.$lb; Am i correct in assuming that file received does not include duplicates ?

     
  • Hey, sorry for my late answer.

    OK, I tested your setup on the site http://www.brendansadventures.com
    The site is VERY slow from here and by default i get a lot of "Socket-stream timed out"-errors from
    the crawler.
    But when i increase the stream-timeout to 20 seconds ($crawler->setStreamTimeout(20)) everything works fine over here, i didn't get a single errror anymore and evey page was received successfully (so far, the crawler still is runnung cause the site really is slow as i said, it's at around 1000 pages now).

    Did you try to increase this value too? Maybe set it to 100 seconds or even more.

    And if you want to know what's the problem if a page couldn'tbe received, just insert something like this in you handleDocumentInfo-method:

    if ($DocInfo->error_occured)
    {
      echo $DocInfo->error_string;
    }

    Hope  this helps finding your problem.

     
  • … and yes,  $report->files_received does NOT include duplicates, the crawler always just receives a page/document once.

     


Anonymous


Cancel   Add attachments