So I have this set:
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
but I am still getting an error where PHP runs out of memory. It seems like SQLite isn't being used…I am crawling a huge site and I need to not use PHP memory for this.
Any advice? What am I missing?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ok, i just took a closer look at your problem and indeed detected a (small) memory leak in phpcrawl (with SQLite-caching enabled).
But it really doesn't eat that much memory over here (Ubuntu, php 5.3.10), so how did you get it to claim 512Mb of memory?
When does your script reach that limit, like after what number of crawled pages?
And could you do me a favour and add "echo memory_get_usage()" to your handleDocumentInfo-method and post the output of it for the first 20 pages or so?
So i can compare it with my results and see if there's an even bigger leak under OSX (i used "php.net" for testign btw.)
public function handleDocumentInfo($DocInfo)
{
// ...
echo memory_get_usage().$lb;
}
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
looks similar over here. Are you using PHP 3.3.1x?
This is a difficult one, it seems that something changed since php 3.3.? regarding memory management (garbage collection) that's causing the leak.
I tried to find the culprit in phpcrawl, but no luck so far. It's a little difficult as i said since there's no real memory-profiler for PHP applications out there, but i'm gonna find it and i let you know.
Thanks for the report by the way!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i think i got it (took me the whole damn night ;) )
Unfortunately there seems to be a bug (memory leak) in PHPs stream_client_connect() function (that's used by phpcrawl since 0.81) as descibed here: https://bugs.php.net/bug.php?id=61371
I don't know if and since what version they fixed it (it's a little confusing there in the report).
Could you please run the test-script from the mentioned bug report on your machine to see if your version of PHP is affected by the leak?
Thank you very much for your help and the report in genral!!
PS: You may use phpcrawl 0.80 until the problem get's solved (somehow), it doesn't use that leaking function.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looos like another memory leak somewhere (or a problem with sqlite with your system).
It's really strange that this occurs after 600-800 oages, never happened here and never heard about something
similar before.
I'll take a closer look at it in a few days, im away for some days now.
Hopefully i can detect something when testing the crawler on that page.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Last night I tried the test interface and it just finished. Here is the output:
Process finished! Links followed: 26761 Kb received: 1229073 Data throughput kb/s: 30
Files received: 14847 Time in sec: 40616.46 Max memory-usage in KB: 340224.00
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
sorry again for my (real) late answer, but i just couldn't find out anything regarding another memory leak.
BUT: I just noticed that you let the crawler receive EVERY type of document in your sript.
Maybe on it's way the crawler tries to receive a huge file into local memory and hits the memory-limit with that.
You should try to use $crawler->addContentTypeReceiveRule("#text/html#"); to your directives, this let's the crawler
ONLY receive html-documents.
Do you know the exact URL the crawler tried to process before it reached the memory-limit?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I have this set:
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
but I am still getting an error where PHP runs out of memory. It seems like SQLite isn't being used…I am crawling a huge site and I need to not use PHP memory for this.
Any advice? What am I missing?
Hi!
Could you please post your complete crawler-setup?
Tehn i'll take a look at it.
And whats the error-message you get?
Thx!
Error: PHP Fatal error: Allowed memory size of 512008042 bytes exhausted.
Your setup looks ok, that's strange.
I'll take a closer look at the problem and the inside tomorrow (i guess), have no time right now.
Best regards!
Thanks,
FWIW, I am running this in OSX 10.8.2.
Hi josepilove,
ok, i just took a closer look at your problem and indeed detected a (small) memory leak in phpcrawl (with SQLite-caching enabled).
But it really doesn't eat that much memory over here (Ubuntu, php 5.3.10), so how did you get it to claim 512Mb of memory?
When does your script reach that limit, like after what number of crawled pages?
And could you do me a favour and add "echo memory_get_usage()" to your handleDocumentInfo-method and post the output of it for the first 20 pages or so?
So i can compare it with my results and see if there's an even bigger leak under OSX (i used "php.net" for testign btw.)
Thanks!
Thanks!
Here is the output for the first 20.
3123424
/downloads.php
3236680
/docs.php
2661648
/FAQ.php
2461944
/support.php
2463528
/mailing-lists.php
2799408
/license
2540448
/sites.php
2487152
/conferences/
3268248
/my.php
3143432
/tut.php
2403152
/usage.php
2425816
/thanks.php
2586776
/feed.atom
2483056
/submit-event.php
2476184
/cal.php?id=5394
3268808
/cal.php?id=5435
3937744
/cal.php?id=4648
3935424
/cal.php?id=409
3935680
/cal.php?id=384
3937456
/cal.php?id=3075
3940528
/cal.php?id=3653
Hi!
looks similar over here. Are you using PHP 3.3.1x?
This is a difficult one, it seems that something changed since php 3.3.? regarding memory management (garbage collection) that's causing the leak.
I tried to find the culprit in phpcrawl, but no luck so far. It's a little difficult as i said since there's no real memory-profiler for PHP applications out there, but i'm gonna find it and i let you know.
Thanks for the report by the way!
I'm using PHP 5.3.15.
Again, thanks so much. let me know if there is anything else i can do to help.
Hey josepilove,
i think i got it (took me the whole damn night ;) )
Unfortunately there seems to be a bug (memory leak) in PHPs stream_client_connect() function (that's used by phpcrawl since 0.81) as descibed here: https://bugs.php.net/bug.php?id=61371
I don't know if and since what version they fixed it (it's a little confusing there in the report).
Could you please run the test-script from the mentioned bug report on your machine to see if your version of PHP is affected by the leak?
Thank you very much for your help and the report in genral!!
PS: You may use phpcrawl 0.80 until the problem get's solved (somehow), it doesn't use that leaking function.
here is the output of that test-script:
memory: 618kb
memory: 619kb
memory: 619kb
memory: 619kb
memory: 619kb
I'm going to give 0.80 a try.
Thanks for all of your help!
Hm, seems that you version of PHP is not affected by this leak.
Let me know if v 0.8 works for you.
still running into the memory limit error.
I am trying to crawl surgery.org, not php.net. I still only get thru about 600-800 pages.
Shit ;)
Looos like another memory leak somewhere (or a problem with sqlite with your system).
It's really strange that this occurs after 600-800 oages, never happened here and never heard about something
similar before.
I'll take a closer look at it in a few days, im away for some days now.
Hopefully i can detect something when testing the crawler on that page.
Last night I tried the test interface and it just finished. Here is the output:
Process finished! Links followed: 26761 Kb received: 1229073 Data throughput kb/s: 30
Files received: 14847 Time in sec: 40616.46 Max memory-usage in KB: 340224.00
Hi,
sorry again for my (real) late answer, but i just couldn't find out anything regarding another memory leak.
BUT: I just noticed that you let the crawler receive EVERY type of document in your sript.
Maybe on it's way the crawler tries to receive a huge file into local memory and hits the memory-limit with that.
You should try to use $crawler->addContentTypeReceiveRule("#text/html#"); to your directives, this let's the crawler
ONLY receive html-documents.
Do you know the exact URL the crawler tried to process before it reached the memory-limit?