Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
I am not sure if this is a bug:
Trying to unset() the crawler object after retrieving the URLs does not clear the memory even after calling gc_collect_cycles();
Tested against CentOS, PHP 5.4.4
echo "<br />Memory before creating crawler: " .number_format( memory_get_usage(true));
$crawler = new PHPCrawler;
echo "<br />Memory after running crawler: " . number_format( memory_get_usage(true));
echo "<br />Memory after destroying crawler: " . number_format( memory_get_usage(true));
Memory before creating crawler: 524,288
Memory after running crawler: 5,242,880
Memory after destroying crawler: 5,242,880
You seem to have CSS turned off.
Please don't fill out this field.
is it maybe related with this bug: http://sourceforge.net/tracker/?func=detail&aid=3579699&group_id=89439&atid=590146?
Will add a comment to that bugreport and wil take a look soon.
Thanx for the report!
I can confirm the memory leak. It's easily reproducable when overriding handleDocumentInfo(), doing nothing else than measuring the memory usage during runtime:
echo "Memory Usage: " . memory_get_usage() . "\n";
The effect stays the same no matter if gc_enable() is called or not.
I've just done a quick search over the lib source without digging into the code and found, that there are no unset methodes used in the classes, nor are there any calls to unset(). From my experience calls to unset() in conjunction with the use of gc_enable() work more or less reliable. As far is i know, php's "garbage collector" isn't able to recycle/destory objects that have no reference left, which happens, when the "mother" object is destroyed without destroying "child" objects first. That could be another source for the memory leak.
Thanks for your report and you suggestions!
What crawler-setup did you use for your test?
If you use the crawler out of the box without changing the URL-cahce-type, it's pretty normal that the memoryusage rises every request because of all the links, cookies and other data that get's added to the internal cache for every crawled page.
So for a test you should set $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE).
Maybe you did.
Also i think you should call gc_collect_cycles() before you call memory_get_usage.
I can confirm too that there is a memory-leak (depending on the php-version used and the OS).
But it's pretty hard to find. For instance, a lot of people using MacOS/iOS report of a heavy memory-leak, others don't seem to have any problems or maybe have just a small leak.
The next version of phpcrawl will get released next week i think, including some memoryusage tweaks and a fix for the mentioned leak in phps stream_client_connect() function. Hopefully it will do the job.
@Gadelkareem (first post): If you use $crawler = null insead if unset($crawler), do you still get the same result?