Menu

$crawler->setPageLimit not being respect

Help
2012-09-29
2013-04-09
  • Nobody/Anonymous

    hi,

    The setPageLimit  is not being respected.  Its set to 5000 but it always go over this limit by far ….

    Here is my setup below, is there somthing im doing wrong?

    $crawler = new MyCrawler();
    $crawler->setURL($base);
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css|pdf)$# i");
    $crawler->enableCookieHandling(true);
    $crawler->setWorkingDirectory("/dev/shm/");
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    $crawler->enableAggressiveLinkSearch(false);
    //$crawler->obeyRobotsTxt(true);
    //$crawler->setConnectionTimeout(8);
    $crawler->setFollowRedirects(true);
    $crawler->setStreamTimeout(5);
    $crawler->setLinkExtractionTags(array("href"));
    $crawler->setUserAgentString("Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0");
    $crawler->setPageLimit(5000); <- not respected.. 
    $crawler->goMultiProcessed(8, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-01

    Hi!

    Strange, i just made a liitle test with exactly your setup and a limit of 500 pages on a random website,
    all works fine over here:

    Links followed: 500
    Documents received: 464
    Bytes received: 42682352 bytes

    Does this happen with all sites you are crawling or just for a particular one?
    What happens if you set a smaller limit like 50? Does the crawler completely ignore
    the setPageLimit-setting over there?

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.