Does this happen with all sites you are crawling or just for a particular one?
What happens if you set a smaller limit like 50? Does the crawler completely ignore
the setPageLimit-setting over there?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi,
The setPageLimit is not being respected. Its set to 5000 but it always go over this limit by far ….
Here is my setup below, is there somthing im doing wrong?
$crawler = new MyCrawler();
$crawler->setURL($base);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css|pdf)$# i");
$crawler->enableCookieHandling(true);
$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableAggressiveLinkSearch(false);
//$crawler->obeyRobotsTxt(true);
//$crawler->setConnectionTimeout(8);
$crawler->setFollowRedirects(true);
$crawler->setStreamTimeout(5);
$crawler->setLinkExtractionTags(array("href"));
$crawler->setUserAgentString("Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0");
$crawler->setPageLimit(5000); <- not respected..
$crawler->goMultiProcessed(8, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);
Hi!
Strange, i just made a liitle test with exactly your setup and a limit of 500 pages on a random website,
all works fine over here:
Links followed: 500
Documents received: 464
Bytes received: 42682352 bytes
Does this happen with all sites you are crawling or just for a particular one?
What happens if you set a smaller limit like 50? Does the crawler completely ignore
the setPageLimit-setting over there?