Menu

Grab links from custom HTML and add them to current crawl instance

Help
Anonymous
2015-08-16
2015-08-29
  • Anonymous

    Anonymous - 2015-08-16

    Hi, if somewhere throughout the crawling procedure I call up on my custom function that will retreive HTML body of a page, how do I strip URLs from that content and insert it into current crawling phase?
    I would like to do it that way because some links that are to be stripped will already be crawled, so I don't want them being crawled again. Otherwise I would just call a new crawler instance.

    I tried with:

    class SBCrawler extends PHPCrawler {

    function handleDocumentInfo(PHPCrawlerDocumentInfo $p){
    $url= $p->url;
    custom_function($url, $p);
    }

    function crawl($u){
    $C = new SBCrawler();
    ....
    ....
    }

    and then in the custom_function:

    function custom_function($url){
    $body = get_file_contents($body);
    $crawler_instance->LinkFinder->findLinksInHTMLChunk($body); die("testing");
    }

    ...but I get the following error:
    Fatal error: Call to a member function findLinksInHTMLChunk() on a non-object in C:\xampp\htdocs\PHPCrawl_083\test.php on line 78

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-08-29

    Hi!

    Sorry for my late answer!

    You should try $this->LinkFinder->findLinksInHTMLChunk($body) instead of $crawler_instance->LinkFinder->findLinksInHTMLChunk($body)

    Seems like $crawler_instance isn't defined in your code (something missing?).

    Also, you don't have to do a getfilecontents($body), this just will request the page again, so you have two requests for every document.
    Simply pass $p->content to your custom_function.

    Just a hint.

     

    Last edit: Uwe Hunfeld 2015-08-29

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.