PHPCrawl / Forum / Help: Grab links from custom HTML and add them to current crawl instance

Grab links from custom HTML and add them to current crawl instance

Forum: Help

Creator: Anonymous

Created: 2015-08-16

Updated: 2015-08-29

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2015-08-16

Hi, if somewhere throughout the crawling procedure I call up on my custom function that will retreive HTML body of a page, how do I strip URLs from that content and insert it into current crawling phase?
I would like to do it that way because some links that are to be stripped will already be crawled, so I don't want them being crawled again. Otherwise I would just call a new crawler instance.

I tried with:

class SBCrawler extends PHPCrawler {

function handleDocumentInfo(PHPCrawlerDocumentInfo $p){
$url= $p->url;
custom_function($url, $p);
}

function crawl($u){
$C = new SBCrawler();
....
....
}

and then in the custom_function:

function custom_function($url){
$body = get_file_contents($body);
$crawler_instance->LinkFinder->findLinksInHTMLChunk($body); die("testing");
}

...but I get the following error:
Fatal error: Call to a member function findLinksInHTMLChunk() on a non-object in C:\xampp\htdocs\PHPCrawl_083\test.php on line 78

Hi, if somewhere throughout the crawling procedure I call up on my custom function that will retreive HTML body of a page, how do I strip URLs from that content and insert it into current crawling phase? I would like to do it that way because some links that are to be stripped will already be crawled, so I don't want them being crawled again. Otherwise I would just call a new crawler instance. I tried with: class SBCrawler extends PHPCrawler { function handleDocumentInfo(PHPCrawlerDocumentInfo $p){ $url= $p->url; custom_function($url, $p); } function crawl($u){ $C = new SBCrawler(); .... .... } and then in the custom_function: function custom_function($url){ $body = get_file_contents($body); $crawler_instance->LinkFinder->findLinksInHTMLChunk($body); die("testing"); } ...but I get the following error: **Fatal error: Call to a member function findLinksInHTMLChunk() on a non-object in C:\xampp\htdocs\PHPCrawl_083\test.php on line 78**

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2015-08-29

Hi!

Sorry for my late answer!

You should try $this->LinkFinder->findLinksInHTMLChunk($body) instead of $crawler_instance->LinkFinder->findLinksInHTMLChunk($body)

Seems like $crawler_instance isn't defined in your code (something missing?).

Also, you don't have to do a getfilecontents($body), this just will request the page again, so you have two requests for every document.
Simply pass $p->content to your custom_function.

Just a hint.

Last edit: Uwe Hunfeld 2015-08-29

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

Grab links from custom HTML and add them to current crawl instance

Forums

Help

Grab links from custom HTML and add them to current crawl instance document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Grab links from custom HTML and add them to current crawl instance