I have got to grips with using PHPCrawl to crawl one website, where I have manually entered what to do with the data I get back. Now I want to crawl many websites in a row, using data I have stored in my database e.g. the start url, follow rules, which tags have which data etc.
Has anyone done this before, and if so, how did you go about doing it? I am thinking of running a loop through my rows, and creating an instance of the Crawler class for each website. Not sure if it will work though.
Any thoughts will be greatly appreciated.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to crawl the URLs from the database one at a time. For each url I will use the multi-process to speed it up, so that part is fine. I'm just not sure how to make it automatically start and finish one url, then immediately go to the next one in the database.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there
I have got to grips with using PHPCrawl to crawl one website, where I have manually entered what to do with the data I get back. Now I want to crawl many websites in a row, using data I have stored in my database e.g. the start url, follow rules, which tags have which data etc.
Has anyone done this before, and if so, how did you go about doing it? I am thinking of running a loop through my rows, and creating an instance of the Crawler class for each website. Not sure if it will work though.
Any thoughts will be greatly appreciated.
Thanks
Hi!
Do you want to crawl your URLs from the database at the same time (parallel)? Or one at the same time?
I want to crawl the URLs from the database one at a time. For each url I will use the multi-process to speed it up, so that part is fine. I'm just not sure how to make it automatically start and finish one url, then immediately go to the next one in the database.
Thanks
You can simply loop over you URLs in the database, like (pseudocode)
Your crawler-class (MyCrawler) containing your handleDocumentInfo-method could be in an seperate file that gets included in your script.
Just an example. Shouldn't be a problem.
Thanks for the reply, got it working now.
Cheers