How to crawl multiple sites from a database

Help
Moin Haque
2012-11-21
2013-04-09
  • Moin Haque

    Moin Haque - 2012-11-21

    Hi there

    I have got to grips with using PHPCrawl to crawl one website, where I have manually entered what to do with the data I get back. Now I want to crawl many websites in a row, using data I have stored in my database e.g. the start url, follow rules, which tags have which data etc.

    Has anyone done this before, and if so, how did you go about doing it? I am thinking of running a loop through my rows, and creating an instance of the Crawler class for each website. Not sure if it will work though.

    Any thoughts will be greatly appreciated.

    Thanks

     
  • Nobody/Anonymous

    Hi!

    Do you want to crawl your URLs from the database at the same time (parallel)? Or one at the same time?

     
  • Moin Haque

    Moin Haque - 2012-11-22

    I want to crawl the URLs from the database one at a time. For each url I will use the multi-process to speed it up, so that part is fine. I'm just not sure how to make it automatically start and finish one url, then immediately go to the next one in the database.

    Thanks

     
  • Nobody/Anonymous

    You can simply loop over you URLs in the database, like (pseudocode)

    $rows = $db->getRows("SELECT * FROM myurls");
    for ($x=0; $x<count($rows); $x++)
    {
      $crawler = new MyCrawler();
      
      $crawler->setURL($rows[$x]["url"]);
      $crawler->addURLFilterRule($rows[$x]["filter_rule"]);
      $crawler->addURLFollowRule($rows[$x]["follow_rule");
      // ...
      $crawler->goMultiProcesses(5); 
    }
    

    Your crawler-class (MyCrawler) containing your handleDocumentInfo-method could be in an seperate file that gets included in your script.

    Just an example. Shouldn't be a problem.

     
  • Moin Haque

    Moin Haque - 2012-11-22

    Thanks for the reply, got it working now.

    Cheers

     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks