$crawler->setPageLimit not being respect

Status: Beta

Brought to you by: huni

$crawler->setPageLimit not being respect

Forum: Help

Creator: Nobody/Anonymous

Created: 2012-09-29

Updated: 2013-04-09

Nobody/Anonymous - 2012-09-29

hi,

The setPageLimit is not being respected. Its set to 5000 but it always go over this limit by far ….

Here is my setup below, is there somthing im doing wrong?

$crawler = new MyCrawler();
$crawler->setURL($base);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css|pdf)$# i");
$crawler->enableCookieHandling(true);
$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableAggressiveLinkSearch(false);
//$crawler->obeyRobotsTxt(true);
//$crawler->setConnectionTimeout(8);
$crawler->setFollowRedirects(true);
$crawler->setStreamTimeout(5);
$crawler->setLinkExtractionTags(array("href"));
$crawler->setUserAgentString("Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0");
$crawler->setPageLimit(5000); <- not respected..
$crawler->goMultiProcessed(8, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-01

Hi!

Strange, i just made a liitle test with exactly your setup and a limit of 500 pages on a random website,
all works fine over here:

Links followed: 500
Documents received: 464
Bytes received: 42682352 bytes

Does this happen with all sites you are crawling or just for a particular one?
What happens if you set a smaller limit like 50? Does the crawler completely ignore
the setPageLimit-setting over there?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous