Menu

Run on multiple machines

Help
Anonymous
2013-12-01
2013-12-03
  • Anonymous

    Anonymous - 2013-12-01

    Hi,

    is it possible to run the crawler on multiple machines sharing one database?
    If not: Would it be difficult to implement such a feature?

    Regards,
    Jim

     
  • Anonymous

    Anonymous - 2013-12-02

    Hi Jim,

    what do you mean with "sharing one database"?

    Of course you can use phpcrawl on muiltiple machines and let each instance put
    it's results into one single mysql-database running on a different server.

    But that's not part of phpcrawl itself, that's part of the user-implementation.
    phpcrawl just returns it's results (found links, page-information etc), and it's up to you what you do with these infos.

     
  • Pan European

    Pan European - 2013-12-02

    I have this implemented. What I have done, to make the crawler efficient, is to have each server use it's own database. When the crawl completes it dumps its data into a central database, running on a high end server (lots of RAM). The crawler(s) then empty their own database ready for the next run. I can then connect to the central database to run off reports when needed. I have had to extend the core code quite a bit to firstly add MySQL functionality, and the ability for it to archive the data it collects.

     
    • Anonymous

      Anonymous - 2013-12-02

      Thats excaclty what I need. But a few questions about this: Why don't you use a single DB that gathers all the collected data from each crawling machine? It looks like you only have results when a crawling machine has finished its crawling and dumped its DB into the main DB. So you have to wait till each crawl machine has finished until you can run your reports.

      How do you balance the work load? One machine performs one crawl job (one submitted URL) or do you distribute the crawls somehow?

       
  • Pan European

    Pan European - 2013-12-03

    The reason I use a central database to store all the collected results and individual databases for each crawler is performance. Think about it. If you have 10 servers running crawlers and only 1 database that they are all using, that database will grow really fast. Queries are going to take longer to run, lots or read / write requests are going to be performed slowing each crawler down dramatically. Also, if the database server went offline, all crawlers would die.

    On the central database I have a table of seed URLs. Each crawler will read from this table when initiated and pick a url as a starting point. It then empties out it's own database from the last run and starts crawling. Once the crawl has completed it will then drop it's data into the central database.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.