Please give me some advice, why multiprocessing working to slow.
Example: If i starting the multiprocess mode with 20 processes the first 10-30 seconds is okay (starting all 20 process and working fine (each process takes about 30-40% CPU time, the "htop" display about 55-75% CPU usage for each Core) the results of the crawling growing fast in the mysql database) but after a few seconds everything slowing down, each process takes only 1-5% CPU usage and the "htop" display only 0-2% CPU usage/core.
The first few seconds the results can grow with 50-100 results/sec, after only 3-8 results/sec. If i leave working for 10 hours long time than i will have about 25-30.000 total results and slowly growing.
Network connection, system, ngix, php, mysql settings is tuned up for a heavy usage.
I use this settings for the PHPCrawl:
setFollowMode(0);
setWorkingDirectory("/dev/shm/");
setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
goMultiProcessed(20, PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);
Basic information from my server:
Debian 7 x64
PHP5.4.39-FPM
Nginx
12 CPU Core (2*i3970k)
64GB RAM -> 19GB /dev/shm/ (from 64GB about 2GB use other applications)
1Gbps Network, dedicated
Please tell me your idea what is the problem, or how can i check the program running step-by-step.
Thank you
John
Last edit: John Hamilton 2015-06-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
just a first question/suggestion: Did you try to run your crawler-setup without your usercode (mysql-inserets and further processing)? Maybe it's just the mysql-database you are using that's slowing it all down.
Looking is everything okay, but slow.
I have no more idea. :(
I changed back URLCACHE_SQLITE to URLCACHE_MEMORY but nothing's changed.
I get new error:
Warning: PDO::query(): SQLSTATE[HY000]: General error: 1 near "s": syntax error in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 83 Fatal error: Call to a member function fetchAll() on a non-object in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 84 Warning: get_headers(): This function may only be used against URLs in /home/xxxxxxxxx/crawler/crawl.php on line 15
In the crawler.php the line 15 is --> get_headers(HOST."/crawler/runCrawl.php");
I turned of the CookieCache ($C->enableCookieHandling(FALSE);) and now slowly, but work.
I running now 10 processes together, i checked with 200 to and look's like the the same result after few hours job.
Any idea please?
Or could you take a quick check to my server over teamviewer please?
Best Regards,
John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
so you tested the two points i posted? (runnning without usercode, try different servers)?
If you send me a (restricted) SSH access to yout server (email), i may take a look these days (no promises, much to do here).
I will upload a CLEAN install of phpcrawl then and do some benchmarks and see if something is going wrong, BUT PLEASE understand that i can't give support for user-implementations, that's just impossible.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
Please give me some advice, why multiprocessing working to slow.
Example: If i starting the multiprocess mode with 20 processes the first 10-30 seconds is okay (starting all 20 process and working fine (each process takes about 30-40% CPU time, the "htop" display about 55-75% CPU usage for each Core) the results of the crawling growing fast in the mysql database) but after a few seconds everything slowing down, each process takes only 1-5% CPU usage and the "htop" display only 0-2% CPU usage/core.
The first few seconds the results can grow with 50-100 results/sec, after only 3-8 results/sec. If i leave working for 10 hours long time than i will have about 25-30.000 total results and slowly growing.
Network connection, system, ngix, php, mysql settings is tuned up for a heavy usage.
I use this settings for the PHPCrawl:
setFollowMode(0);
setWorkingDirectory("/dev/shm/");
setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
goMultiProcessed(20, PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);
Basic information from my server:
Debian 7 x64
PHP5.4.39-FPM
Nginx
12 CPU Core (2*i3970k)
64GB RAM -> 19GB /dev/shm/ (from 64GB about 2GB use other applications)
1Gbps Network, dedicated
Please tell me your idea what is the problem, or how can i check the program running step-by-step.
Thank you
John
Last edit: John Hamilton 2015-06-10
Hi John,
just a first question/suggestion: Did you try to run your crawler-setup without your usercode (mysql-inserets and further processing)? Maybe it's just the mysql-database you are using that's slowing it all down.
Also, did you try it with a different website (hosted on a different webserver)? Some webservers just "go down" if it comes to many simultaneous requests.
You also can check this by looking at the benchmark-properties in the Process-report (see benchmark-section here: http://phpcrawl.cuab.de/classreferences/PHPCrawlerProcessReport/overview.html) and/or the benchmark-props of the DocumentInfo-object of every request (see benchmark-section here: http://phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html)
Please give that a try to exclude this factors.
Best Regards!
Hi Uwe,
Looking is everything okay, but slow.
I have no more idea. :(
I changed back URLCACHE_SQLITE to URLCACHE_MEMORY but nothing's changed.
I get new error:
Warning: PDO::query(): SQLSTATE[HY000]: General error: 1 near "s": syntax error in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 83 Fatal error: Call to a member function fetchAll() on a non-object in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 84 Warning: get_headers(): This function may only be used against URLs in /home/xxxxxxxxx/crawler/crawl.php on line 15
In the crawler.php the line 15 is --> get_headers(HOST."/crawler/runCrawl.php");
I turned of the CookieCache ($C->enableCookieHandling(FALSE);) and now slowly, but work.
I running now 10 processes together, i checked with 200 to and look's like the the same result after few hours job.
Any idea please?
Or could you take a quick check to my server over teamviewer please?
Best Regards,
John
Hi John,
so you tested the two points i posted? (runnning without usercode, try different servers)?
If you send me a (restricted) SSH access to yout server (email), i may take a look these days (no promises, much to do here).
I will upload a CLEAN install of phpcrawl then and do some benchmarks and see if something is going wrong, BUT PLEASE understand that i can't give support for user-implementations, that's just impossible.