PHPCrawl / Forum / Help: Run on multiple machines

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-01

Hi,

is it possible to run the crawler on multiple machines sharing one database?
If not: Would it be difficult to implement such a feature?

Regards,
Jim

Hi, is it possible to run the crawler on multiple machines sharing one database? If not: Would it be difficult to implement such a feature? Regards, Jim

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-02

Hi Jim,

what do you mean with "sharing one database"?

Of course you can use phpcrawl on muiltiple machines and let each instance put
it's results into one single mysql-database running on a different server.

But that's not part of phpcrawl itself, that's part of the user-implementation.
phpcrawl just returns it's results (found links, page-information etc), and it's up to you what you do with these infos.

Hi Jim, what do you mean with "sharing one database"? Of course you can use phpcrawl on muiltiple machines and let each instance put it's results into one single mysql-database running on a different server. But that's not part of phpcrawl itself, that's part of the user-implementation. phpcrawl just returns it's results (found links, page-information etc), and it's up to you what you do with these infos.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-12-02

I have this implemented. What I have done, to make the crawler efficient, is to have each server use it's own database. When the crawl completes it dumps its data into a central database, running on a high end server (lots of RAM). The crawler(s) then empty their own database ready for the next run. I can then connect to the central database to run off reports when needed. I have had to extend the core code quite a bit to firstly add MySQL functionality, and the ability for it to archive the data it collects.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Comment has been marked as spam.
  Undo
  
  View and moderate all "Help" comments posted by this user
  
  Mark all as spam, and block user from posting to "Forum"
  
  Anonymous - 2013-12-02
  
  Thats excaclty what I need. But a few questions about this: Why don't you use a single DB that gathers all the collected data from each crawling machine? It looks like you only have results when a crawling machine has finished its crawling and dumped its DB into the main DB. So you have to wait till each crawl machine has finished until you can run your reports.
  
  How do you balance the work load? One machine performs one crawl job (one submitted URL) or do you distribute the crawls somehow?
  
  Thats excaclty what I need. But a few questions about this: Why don't you use a single DB that gathers all the collected data from each crawling machine? It looks like you only have results when a crawling machine has finished its crawling and dumped its DB into the main DB. So you have to wait till each crawl machine has finished until you can run your reports. How do you balance the work load? One machine performs one crawl job (one submitted URL) or do you distribute the crawls somehow?
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
  
  New Attachment:
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Pan European - 2013-12-03

The reason I use a central database to store all the collected results and individual databases for each crawler is performance. Think about it. If you have 10 servers running crawlers and only 1 database that they are all using, that database will grow really fast. Queries are going to take longer to run, lots or read / write requests are going to be performed slowing each crawler down dramatically. Also, if the database server went offline, all crawlers would die.

On the central database I have a table of seed URLs. Each crawler will read from this table when initiated and pick a url as a starting point. It then empties out it's own database from the last run and starts crawling. Once the crawl has completed it will then drop it's data into the central database.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Run on multiple machines

Forums

Help

Run on multiple machines document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Run on multiple machines