PHP Memory Limit Error WITH SQLite Caching On

Help
josepilove
2012-10-18
2013-04-09
  • josepilove
    josepilove
    2012-10-18

    So I have this set:
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

    but I am still getting an error where PHP runs out of memory. It seems like SQLite isn't being used…I am crawling a huge site and I need to not use PHP memory for this.

    Any advice? What am I missing?

     
  • Hi!

    Could you please post your complete crawler-setup?
    Tehn i'll take a look at it.

    And whats the error-message you get?

    Thx!

     
  • josepilove
    josepilove
    2012-10-18

    <?php
    $url = "http://php.net";
    include("libs/PHPCrawler.class.php");
    class MyCrawler extends PHPCrawler 
    {
        function handleDocumentInfo($DocInfo) 
      {
        $url = "http://php.net";
        $fp = fopen('text.csv', 'a');
        $url_ = str_replace($url, '', $DocInfo->url);
        $level = substr_count($url_, '/');
        $level = $level-1;
        $input = array();
        array_push($input,'');
        array_push($input, $DocInfo->http_status_code);
        array_push($input, $level+1);
        for($i=0; $i<=$level; $i++){
            array_push($input, '');
        }
        array_push($input, $url_);
        fputcsv($fp, $input);
        echo $url_."\n";
        flush();
      } 
    }
    $crawler = new MyCrawler();
    $crawler->setURL($url);
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    $crawler->enableCookieHandling(true);
    $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
    if (isset($url)){
        echo "Crawling ".$url." this might take a while...";
        $crawler->go(); 
    }
    $report = $crawler->getProcessReport();
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
        
    echo "Summary:".$lb;
    echo "Links followed: ".$report->links_followed.$lb;
    echo "Documents received: ".$report->files_received.$lb;
    echo "Bytes received: ".$report->bytes_received." bytes".$lb;
    echo "Process runtime: ".$report->process_runtime." sec".$lb; 
    ?>
    
     
  • josepilove
    josepilove
    2012-10-18

    Error: PHP Fatal error: Allowed memory size of 512008042 bytes exhausted.

     
  • Your setup looks ok, that's strange.

    I'll take a closer look at the problem and the inside tomorrow (i guess), have no time right now.

    Best regards!

     
  • Hi josepilove,

    ok, i just took a closer look at your problem and indeed detected a (small) memory leak in phpcrawl (with SQLite-caching enabled).
    But it really doesn't eat that much memory over here (Ubuntu, php 5.3.10), so how did you get it to claim 512Mb of memory?
    When does your script reach that limit, like after what number of crawled pages?

    And could you do me a favour and add "echo memory_get_usage()" to your handleDocumentInfo-method and post the output of it for the first 20 pages or so?
    So i can compare it with my results and see if there's an even bigger leak under OSX (i used "php.net" for testign btw.)

    public function handleDocumentInfo($DocInfo) 
    {
      // ...
     echo memory_get_usage().$lb;
    }
    

    Thanks!

     
  • josepilove
    josepilove
    2012-10-22

    Thanks!

    Here is the output for the first 20.

    3123424
    /downloads.php
    3236680
    /docs.php
    2661648
    /FAQ.php
    2461944
    /support.php
    2463528
    /mailing-lists.php
    2799408
    /license
    2540448
    /sites.php
    2487152
    /conferences/
    3268248
    /my.php
    3143432
    /tut.php
    2403152
    /usage.php
    2425816
    /thanks.php
    2586776
    /feed.atom
    2483056
    /submit-event.php
    2476184
    /cal.php?id=5394
    3268808
    /cal.php?id=5435
    3937744
    /cal.php?id=4648
    3935424
    /cal.php?id=409
    3935680
    /cal.php?id=384
    3937456
    /cal.php?id=3075
    3940528
    /cal.php?id=3653

     
  • Hi!

    looks similar over here. Are you using PHP 3.3.1x?
    This is a difficult one, it seems that something changed since php 3.3.? regarding memory management (garbage collection) that's causing the leak.
    I tried to find the culprit in phpcrawl, but no luck so far. It's a little difficult as i said since there's no real memory-profiler for PHP applications out there, but i'm gonna find it and i let you know.

    Thanks for the report by the way!

     
  • josepilove
    josepilove
    2012-10-22

    I'm using PHP 5.3.15.

    Again, thanks so much. let me know if there is anything else i can do to help.

     
  • Hey josepilove,

    i think i got it (took me the whole damn night ;) )

    Unfortunately there seems to be a bug (memory leak) in PHPs stream_client_connect() function (that's used by phpcrawl since 0.81) as descibed here: https://bugs.php.net/bug.php?id=61371

    I don't know if and since what version they fixed it (it's a little confusing there in the report).

    Could you please run the test-script from the mentioned bug report on your machine to see if your version of PHP is affected by the leak?

    Thank you very much for your help and the report in genral!!

    PS: You may use phpcrawl 0.80 until the problem get's solved (somehow), it doesn't use that leaking function.

     
  • josepilove
    josepilove
    2012-10-23

    here is the output of that test-script:

    memory: 618kb
    memory: 619kb
    memory: 619kb
    memory: 619kb
    memory: 619kb

    I'm going to give 0.80 a try.

    Thanks for all of your help!

     
  • Hm, seems that you version of PHP is not affected by this leak.
    Let me know if v 0.8 works for you.

     
  • josepilove
    josepilove
    2012-10-24

    still running into the memory limit error.

    I am trying to crawl surgery.org, not php.net. I still only get thru about 600-800 pages.

     
  • Shit ;)

    Looos like another memory leak somewhere (or a problem with sqlite with your system).
    It's really strange that this occurs after 600-800 oages, never happened here and never heard about something
    similar before.

    I'll take a closer look at it in a few days, im away for some days now.
    Hopefully i can detect something  when testing the crawler on that page.

     
  • josepilove
    josepilove
    2012-10-26

    Last night I tried the test interface and it just finished. Here is the output:

    Process finished! Links followed: 26761  Kb received: 1229073  Data throughput kb/s: 30 
    Files received: 14847  Time in sec: 40616.46  Max memory-usage in KB: 340224.00

     
  • Hi,

    sorry again for my (real) late answer, but i just couldn't find out anything regarding another memory leak.

    BUT: I just noticed that you let the crawler receive EVERY type of document in your sript.
    Maybe on it's way the crawler tries to receive a huge file into local memory and hits the memory-limit with that.

    You should try to use $crawler->addContentTypeReceiveRule("#text/html#"); to your directives, this let's the crawler
    ONLY receive html-documents.

    Do you know the exact URL the crawler tried to process before it reached the memory-limit?

     


Anonymous


Cancel   Add attachments