To fix it, simply change the followig lines of code in the file "classes/phpcrawler.class.php"
(from line 120):
Original code:
// PageRequest-class
if (!class_exists("PHPCrawlerPageRequest"))
{
include_once($classpath."/phpcrawlerpagerequest.class.php");
// Initiate a new PageRequestor
$this->pageRequest = new PHPCrawlerPageRequest();
}
Fixed code:
// PageRequest-class
if (!class_exists("PHPCrawlerPageRequest"))
{
include_once($classpath."/phpcrawlerpagerequest.class.php");
}
// Initiate a new PageRequestor
$this->pageRequest = new PHPCrawlerPageRequest();
This should word!
And thanks for the report!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your reply. Please do let me know if you accept donations; I would love to support this project.
Also, I have a quick question: if I want crawl more than one URL in the same script, is it OK to set the URL again and then $crawler->go(); again? i.e. Are all the old variables properly destroyed? For example, the following:
Also, I would like to mention that I have encountered some websites that give a segmentation fault.
One of them is airsilver.net. For reference, the OS is CentOS 5.4 and the PHP version is 5.2.12. I've noticed that whenever I get a segmentation fault, the URL which I am crawling is in a foreign language; this may or may not be causing it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thannk you very much for your will to donate to this project, i appreciate that!
I will enable the donate-option when i'm gonna release a new version of phpcrawl that
is up to date. Thanks!
I dont't recommend to use the same instance of the crawler-class for crawling more than URL.
You better create a new instance, so you get a really clean object.
And i just did a quick test and crawled some sites from "airsilver.net". I don't get any problems or segfaults over here.
(Old Ubuntu 8.04.4, PHP 5.2.4). But i will do some tests within some newer environment soon.
Maybe there is a crawler-setting you are using that's causing the problem?!
Could you post your setup?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With PHP 5.2.4 (Ubuntu 8.04.4) and PHP 5.2.3 (Debian 4.0) the segfault occurs over here when phpcrawl tries to get containing links from the html-source of the site "http://www.airsilver.net/ch27A.html".
The PCRE (preg_match_all) used for doing that exits with a segmentation fault on that site.
Hello,
I seem to be getting the following error when I have multiple objects:
"Fatal error: Call to undefined method stdClass::receivePage()"
The $crawler object crawls fine, however the above error message is received when $anothercrawler->go(); is executed.
Can someone please help? Thank you.
$crawler = &new MyCrawler();
$crawler->setURL($url);
$crawler->setFollowMode(1);
$crawler->addReceiveContentType("/text\/html/");
$crawler->addNonFollowMatch("/.(jpg|gif|png)$/ i");
$crawler->setCookieHandling(true);
$crawler->go();
$anothercrawler = &new MyCrawler();
$anothercrawler->setURL($url);
$anothercrawler->setFollowMode(1);
$anothercrawler->addReceiveContentType("/text\/html/");
$anothercrawler->addNonFollowMatch("/.(jpg|gif|png)$/ i");
$anothercrawler->setCookieHandling(true);
$anothercrawler->go();
Hi alanchau,
i can confirm that issue here, this is a bug!
To fix it, simply change the followig lines of code in the file "classes/phpcrawler.class.php"
(from line 120):
Original code:
Fixed code:
This should word!
And thanks for the report!
Hello huni,
Thank you for your reply. Please do let me know if you accept donations; I would love to support this project.
Also, I have a quick question: if I want crawl more than one URL in the same script, is it OK to set the URL again and then $crawler->go(); again? i.e. Are all the old variables properly destroyed? For example, the following:
Also, I would like to mention that I have encountered some websites that give a segmentation fault.
One of them is airsilver.net. For reference, the OS is CentOS 5.4 and the PHP version is 5.2.12. I've noticed that whenever I get a segmentation fault, the URL which I am crawling is in a foreign language; this may or may not be causing it.
Hello alanchau again,
thannk you very much for your will to donate to this project, i appreciate that!
I will enable the donate-option when i'm gonna release a new version of phpcrawl that
is up to date. Thanks!
I dont't recommend to use the same instance of the crawler-class for crawling more than URL.
You better create a new instance, so you get a really clean object.
And i just did a quick test and crawled some sites from "airsilver.net". I don't get any problems or segfaults over here.
(Old Ubuntu 8.04.4, PHP 5.2.4). But i will do some tests within some newer environment soon.
Maybe there is a crawler-setting you are using that's causing the problem?!
Could you post your setup?
Thanks!
Hello,
This is the script that I am using and I am running it in command line:
The following is the output of the above command:
Hi alanchau ,
yes, i'm able to repoduce the segfault now!
With PHP 5.2.4 (Ubuntu 8.04.4) and PHP 5.2.3 (Debian 4.0) the segfault occurs over here when phpcrawl tries to get containing links from the html-source of the site "http://www.airsilver.net/ch27A.html".
The PCRE (preg_match_all) used for doing that exits with a segmentation fault on that site.
It seems that this was a bug in PHP and/or in the bundled PCRE-library.
(http://bugs.php.net/bug.php?id=41796 / http://bugs.php.net/bug.php?id=45735 …)
Running the samt script using PHP 5.2.10 (Ubuntu 9.10) the segfault does NOT occur anymore.
I'm not sure in which version of PHP the bug was fixed exactly, but just try to upgrade PHP to a newer version
if possible.
Thanks for the report again!