I am not sure if this is a bug:
Trying to unset() the crawler object after retrieving the URLs does not clear the memory even after calling gc_collect_cycles();
Tested against CentOS, PHP 5.4.4
echo"<br />Memory before creating crawler: ".number_format(memory_get_usage(true));//Runningcrawler$crawler=newPHPCrawler;$crawler->setURL('http://www.dmoz.org/');$crawler->setPageLimit(30);$crawler->go();echo"<br />Memory after running crawler: ".number_format(memory_get_usage(true));//clearingmemoryunset($crawler);gc_collect_cycles();echo"<br />Memory after destroying crawler: ".number_format(memory_get_usage(true));
result:
Memory before creating crawler: 524,288
Memory after running crawler: 5,242,880
Memory after destroying crawler: 5,242,880
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can confirm the memory leak. It's easily reproducable when overriding handleDocumentInfo(), doing nothing else than measuring the memory usage during runtime:
The effect stays the same no matter if gc_enable() is called or not.
I've just done a quick search over the lib source without digging into the code and found, that there are no unset methodes used in the classes, nor are there any calls to unset(). From my experience calls to unset() in conjunction with the use of gc_enable() work more or less reliable. As far is i know, php's "garbage collector" isn't able to recycle/destory objects that have no reference left, which happens, when the "mother" object is destroyed without destroying "child" objects first. That could be another source for the memory leak.
-Frank
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What crawler-setup did you use for your test?
If you use the crawler out of the box without changing the URL-cahce-type, it's pretty normal that the memoryusage rises every request because of all the links, cookies and other data that get's added to the internal cache for every crawled page.
So for a test you should set $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE).
Maybe you did.
Also i think you should call gc_collect_cycles() before you call memory_get_usage.
I can confirm too that there is a memory-leak (depending on the php-version used and the OS).
But it's pretty hard to find. For instance, a lot of people using MacOS/iOS report of a heavy memory-leak, others don't seem to have any problems or maybe have just a small leak.
The next version of phpcrawl will get released next week i think, including some memoryusage tweaks and a fix for the mentioned leak in phps stream_client_connect() function. Hopefully it will do the job.
@Gadelkareem (first post): If you use $crawler = null insead if unset($crawler), do you still get the same result?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not sure if this is a bug:
Trying to unset() the crawler object after retrieving the URLs does not clear the memory even after calling gc_collect_cycles();
Tested against CentOS, PHP 5.4.4
result:
Hi kurtubba,
is it maybe related with this bug: http://sourceforge.net/tracker/?func=detail&aid=3579699&group_id=89439&atid=590146?
Will add a comment to that bugreport and wil take a look soon.
Thanx for the report!
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I can confirm the memory leak. It's easily reproducable when overriding handleDocumentInfo(), doing nothing else than measuring the memory usage during runtime:
The effect stays the same no matter if gc_enable() is called or not.
I've just done a quick search over the lib source without digging into the code and found, that there are no unset methodes used in the classes, nor are there any calls to unset(). From my experience calls to unset() in conjunction with the use of gc_enable() work more or less reliable. As far is i know, php's "garbage collector" isn't able to recycle/destory objects that have no reference left, which happens, when the "mother" object is destroyed without destroying "child" objects first. That could be another source for the memory leak.
-Frank
Hi Frank!
Thanks for your report and you suggestions!
What crawler-setup did you use for your test?
If you use the crawler out of the box without changing the URL-cahce-type, it's pretty normal that the memoryusage rises every request because of all the links, cookies and other data that get's added to the internal cache for every crawled page.
So for a test you should set $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE).
Maybe you did.
Also i think you should call gc_collect_cycles() before you call memory_get_usage.
I can confirm too that there is a memory-leak (depending on the php-version used and the OS).
But it's pretty hard to find. For instance, a lot of people using MacOS/iOS report of a heavy memory-leak, others don't seem to have any problems or maybe have just a small leak.
The next version of phpcrawl will get released next week i think, including some memoryusage tweaks and a fix for the mentioned leak in phps stream_client_connect() function. Hopefully it will do the job.
@Gadelkareem (first post): If you use $crawler = null insead if unset($crawler), do you still get the same result?