How to crawl multiple sites from a database

Status: Beta

Brought to you by: huni

How to crawl multiple sites from a database

Forum: Help

Creator: Moin Haque

Created: 2012-11-21

Updated: 2013-04-09

Moin Haque - 2012-11-21

Hi there

I have got to grips with using PHPCrawl to crawl one website, where I have manually entered what to do with the data I get back. Now I want to crawl many websites in a row, using data I have stored in my database e.g. the start url, follow rules, which tags have which data etc.

Has anyone done this before, and if so, how did you go about doing it? I am thinking of running a loop through my rows, and creating an instance of the Crawler class for each website. Not sure if it will work though.

Any thoughts will be greatly appreciated.

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-11-21

Hi!

Do you want to crawl your URLs from the database at the same time (parallel)? Or one at the same time?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Moin Haque - 2012-11-22

I want to crawl the URLs from the database one at a time. For each url I will use the multi-process to speed it up, so that part is fine. I'm just not sure how to make it automatically start and finish one url, then immediately go to the next one in the database.

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-11-22

You can simply loop over you URLs in the database, like (pseudocode)

$rows = $db->getRows("SELECT * FROM myurls"); for ($x=0; $x<count($rows); $x++) { $crawler = new MyCrawler(); $crawler->setURL($rows[$x]["url"]); $crawler->addURLFilterRule($rows[$x]["filter_rule"]); $crawler->addURLFollowRule($rows[$x]["follow_rule"); // ... $crawler->goMultiProcesses(5); }

Your crawler-class (MyCrawler) containing your handleDocumentInfo-method could be in an seperate file that gets included in your script.

Just an example. Shouldn't be a problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Moin Haque - 2012-11-22

Thanks for the reply, got it working now.

Cheers

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous