Of course you can use phpcrawl on muiltiple machines and let each instance put
it's results into one single mysql-database running on a different server.
But that's not part of phpcrawl itself, that's part of the user-implementation.
phpcrawl just returns it's results (found links, page-information etc), and it's up to you what you do with these infos.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have this implemented. What I have done, to make the crawler efficient, is to have each server use it's own database. When the crawl completes it dumps its data into a central database, running on a high end server (lots of RAM). The crawler(s) then empty their own database ready for the next run. I can then connect to the central database to run off reports when needed. I have had to extend the core code quite a bit to firstly add MySQL functionality, and the ability for it to archive the data it collects.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thats excaclty what I need. But a few questions about this: Why don't you use a single DB that gathers all the collected data from each crawling machine? It looks like you only have results when a crawling machine has finished its crawling and dumped its DB into the main DB. So you have to wait till each crawl machine has finished until you can run your reports.
How do you balance the work load? One machine performs one crawl job (one submitted URL) or do you distribute the crawls somehow?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The reason I use a central database to store all the collected results and individual databases for each crawler is performance. Think about it. If you have 10 servers running crawlers and only 1 database that they are all using, that database will grow really fast. Queries are going to take longer to run, lots or read / write requests are going to be performed slowing each crawler down dramatically. Also, if the database server went offline, all crawlers would die.
On the central database I have a table of seed URLs. Each crawler will read from this table when initiated and pick a url as a starting point. It then empties out it's own database from the last run and starts crawling. Once the crawl has completed it will then drop it's data into the central database.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi,
is it possible to run the crawler on multiple machines sharing one database?
If not: Would it be difficult to implement such a feature?
Regards,
Jim
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Jim,
what do you mean with "sharing one database"?
Of course you can use phpcrawl on muiltiple machines and let each instance put
it's results into one single mysql-database running on a different server.
But that's not part of phpcrawl itself, that's part of the user-implementation.
phpcrawl just returns it's results (found links, page-information etc), and it's up to you what you do with these infos.
I have this implemented. What I have done, to make the crawler efficient, is to have each server use it's own database. When the crawl completes it dumps its data into a central database, running on a high end server (lots of RAM). The crawler(s) then empty their own database ready for the next run. I can then connect to the central database to run off reports when needed. I have had to extend the core code quite a bit to firstly add MySQL functionality, and the ability for it to archive the data it collects.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thats excaclty what I need. But a few questions about this: Why don't you use a single DB that gathers all the collected data from each crawling machine? It looks like you only have results when a crawling machine has finished its crawling and dumped its DB into the main DB. So you have to wait till each crawl machine has finished until you can run your reports.
How do you balance the work load? One machine performs one crawl job (one submitted URL) or do you distribute the crawls somehow?
The reason I use a central database to store all the collected results and individual databases for each crawler is performance. Think about it. If you have 10 servers running crawlers and only 1 database that they are all using, that database will grow really fast. Queries are going to take longer to run, lots or read / write requests are going to be performed slowing each crawler down dramatically. Also, if the database server went offline, all crawlers would die.
On the central database I have a table of seed URLs. Each crawler will read from this table when initiated and pick a url as a starting point. It then empties out it's own database from the last run and starts crawling. Once the crawl has completed it will then drop it's data into the central database.