Hi, if somewhere throughout the crawling procedure I call up on my custom function that will retreive HTML body of a page, how do I strip URLs from that content and insert it into current crawling phase?
I would like to do it that way because some links that are to be stripped will already be crawled, so I don't want them being crawled again. Otherwise I would just call a new crawler instance.
I tried with:
class SBCrawler extends PHPCrawler {
function handleDocumentInfo(PHPCrawlerDocumentInfo $p){
$url= $p->url;
custom_function($url, $p);
}
function crawl($u){
$C = new SBCrawler();
....
....
}
and then in the custom_function:
function custom_function($url){
$body = get_file_contents($body);
$crawler_instance->LinkFinder->findLinksInHTMLChunk($body); die("testing");
}
...but I get the following error: Fatal error: Call to a member function findLinksInHTMLChunk() on a non-object in C:\xampp\htdocs\PHPCrawl_083\test.php on line 78
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You should try $this->LinkFinder->findLinksInHTMLChunk($body) instead of $crawler_instance->LinkFinder->findLinksInHTMLChunk($body)
Seems like $crawler_instance isn't defined in your code (something missing?).
Also, you don't have to do a getfilecontents($body), this just will request the page again, so you have two requests for every document.
Simply pass $p->content to your custom_function.
Just a hint.
Last edit: Uwe Hunfeld 2015-08-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi, if somewhere throughout the crawling procedure I call up on my custom function that will retreive HTML body of a page, how do I strip URLs from that content and insert it into current crawling phase?
I would like to do it that way because some links that are to be stripped will already be crawled, so I don't want them being crawled again. Otherwise I would just call a new crawler instance.
I tried with:
class SBCrawler extends PHPCrawler {
function handleDocumentInfo(PHPCrawlerDocumentInfo $p){
$url= $p->url;
custom_function($url, $p);
}
function crawl($u){
$C = new SBCrawler();
....
....
}
and then in the custom_function:
function custom_function($url){
$body = get_file_contents($body);
$crawler_instance->LinkFinder->findLinksInHTMLChunk($body); die("testing");
}
...but I get the following error:
Fatal error: Call to a member function findLinksInHTMLChunk() on a non-object in C:\xampp\htdocs\PHPCrawl_083\test.php on line 78
Hi!
Sorry for my late answer!
You should try $this->LinkFinder->findLinksInHTMLChunk($body) instead of $crawler_instance->LinkFinder->findLinksInHTMLChunk($body)
Seems like $crawler_instance isn't defined in your code (something missing?).
Also, you don't have to do a getfilecontents($body), this just will request the page again, so you have two requests for every document.
Simply pass $p->content to your custom_function.
Just a hint.
Last edit: Uwe Hunfeld 2015-08-29