How to crawl multiple sites from a database

Help
Moin Haque
2012-11-21
2013-04-09
  • Moin Haque
    Moin Haque
    2012-11-21

    Hi there

    I have got to grips with using PHPCrawl to crawl one website, where I have manually entered what to do with the data I get back. Now I want to crawl many websites in a row, using data I have stored in my database e.g. the start url, follow rules, which tags have which data etc.

    Has anyone done this before, and if so, how did you go about doing it? I am thinking of running a loop through my rows, and creating an instance of the Crawler class for each website. Not sure if it will work though.

    Any thoughts will be greatly appreciated.

    Thanks

     
  • Hi!

    Do you want to crawl your URLs from the database at the same time (parallel)? Or one at the same time?

     
  • Moin Haque
    Moin Haque
    2012-11-22

    I want to crawl the URLs from the database one at a time. For each url I will use the multi-process to speed it up, so that part is fine. I'm just not sure how to make it automatically start and finish one url, then immediately go to the next one in the database.

    Thanks

     
  • You can simply loop over you URLs in the database, like (pseudocode)

    $rows = $db->getRows("SELECT * FROM myurls");
    for ($x=0; $x<count($rows); $x++)
    {
      $crawler = new MyCrawler();
      
      $crawler->setURL($rows[$x]["url"]);
      $crawler->addURLFilterRule($rows[$x]["filter_rule"]);
      $crawler->addURLFollowRule($rows[$x]["follow_rule");
      // ...
      $crawler->goMultiProcesses(5); 
    }
    

    Your crawler-class (MyCrawler) containing your handleDocumentInfo-method could be in an seperate file that gets included in your script.

    Just an example. Shouldn't be a problem.

     


Anonymous


Cancel   Add attachments