Menu

Multiprocess to slow

Help
2015-06-10
2015-06-15
  • John Hamilton

    John Hamilton - 2015-06-10

    Hello,

    Please give me some advice, why multiprocessing working to slow.
    Example: If i starting the multiprocess mode with 20 processes the first 10-30 seconds is okay (starting all 20 process and working fine (each process takes about 30-40% CPU time, the "htop" display about 55-75% CPU usage for each Core) the results of the crawling growing fast in the mysql database) but after a few seconds everything slowing down, each process takes only 1-5% CPU usage and the "htop" display only 0-2% CPU usage/core.
    The first few seconds the results can grow with 50-100 results/sec, after only 3-8 results/sec. If i leave working for 10 hours long time than i will have about 25-30.000 total results and slowly growing.
    Network connection, system, ngix, php, mysql settings is tuned up for a heavy usage.

    I use this settings for the PHPCrawl:
    setFollowMode(0);
    setWorkingDirectory("/dev/shm/");
    setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    goMultiProcessed(20, PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

    Basic information from my server:
    Debian 7 x64
    PHP5.4.39-FPM
    Nginx
    12 CPU Core (2*i3970k)
    64GB RAM -> 19GB /dev/shm/ (from 64GB about 2GB use other applications)
    1Gbps Network, dedicated

    Please tell me your idea what is the problem, or how can i check the program running step-by-step.

    Thank you
    John

     

    Last edit: John Hamilton 2015-06-10
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-06-11

    Hi John,

    just a first question/suggestion: Did you try to run your crawler-setup without your usercode (mysql-inserets and further processing)? Maybe it's just the mysql-database you are using that's slowing it all down.

    Also, did you try it with a different website (hosted on a different webserver)? Some webservers just "go down" if it comes to many simultaneous requests.
    You also can check this by looking at the benchmark-properties in the Process-report (see benchmark-section here: http://phpcrawl.cuab.de/classreferences/PHPCrawlerProcessReport/overview.html) and/or the benchmark-props of the DocumentInfo-object of every request (see benchmark-section here: http://phpcrawl.cuab.de/classreferences/PHPCrawlerDocumentInfo/overview.html)

    Please give that a try to exclude this factors.

    Best Regards!

     
  • John Hamilton

    John Hamilton - 2015-06-14

    Hi Uwe,

    Looking is everything okay, but slow.
    I have no more idea. :(
    I changed back URLCACHE_SQLITE to URLCACHE_MEMORY but nothing's changed.

    I get new error:
    Warning: PDO::query(): SQLSTATE[HY000]: General error: 1 near "s": syntax error in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 83 Fatal error: Call to a member function fetchAll() on a non-object in /home/xxxxxxxxxx/crawler/PHPCrawl/libs/CookieCache/PHPCrawlerSQLiteCookieCache.class.php on line 84 Warning: get_headers(): This function may only be used against URLs in /home/xxxxxxxxx/crawler/crawl.php on line 15
    In the crawler.php the line 15 is --> get_headers(HOST."/crawler/runCrawl.php");
    I turned of the CookieCache ($C->enableCookieHandling(FALSE);) and now slowly, but work.
    I running now 10 processes together, i checked with 200 to and look's like the the same result after few hours job.

    Any idea please?

    Or could you take a quick check to my server over teamviewer please?

    Best Regards,
    John

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-06-15

    Hi John,

    so you tested the two points i posted? (runnning without usercode, try different servers)?

    If you send me a (restricted) SSH access to yout server (email), i may take a look these days (no promises, much to do here).

    I will upload a CLEAN install of phpcrawl then and do some benchmarks and see if something is going wrong, BUT PLEASE understand that i can't give support for user-implementations, that's just impossible.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.