There are some memory leaks, check this config i.e. - it stops at 4000 symbols with a run ot of mem exception. this is just one sample I have some scrapers with 2gb ram and stil run out of mem. I think it leaks at the xml processing html-to-xml, xpath etc.
<config>
<var-def name="sectors">
<html-to-xml>
<http url="http://biz.yahoo.com/ic/ind_index.html"/>
</html-to-xml>
</var-def>
<loop item="industryLink" index="i" filter="unique"> <list> <xpath expression="//a[contains(@href,'industryindex')]/@href"> <var name="sectors"/> </xpath> </list> <body> <var-def name="industry" overwrite="true"> <xpath expression="//a[@href='${industryLink}']/../../../preceding-sibling::tr[1]/td/font/b/text()"> <var name="sectors"/> </xpath> </var-def> <var-def name="subindustry" overwrite="true"> <xpath expression="//a[@href='${industryLink}']/text()"> <var name="sectors"/> </xpath> </var-def> <var-def name="index"> <regexp> <regexp-pattern>s=([^=]+)$</regexp-pattern> <regexp-source> <xpath expression="//a[contains(@href, '*http://finance.yahoo.com/q?s=^')]/@href"> <html-to-xml> <http url="${industryLink}"/> </html-to-xml> </xpath> </regexp-source> <regexp-result><template>${_1}</template></regexp-result> </regexp> </var-def> <var-def name="companiesLink" overwrite="true"> <regexp> <regexp-pattern>(\d+)\.html$</regexp-pattern> <regexp-source><var name="industryLink"/></regexp-source> <regexp-result><template>http://biz.yahoo.com/ic/${_1}_cl_all.html</template></regexp-result> </regexp> </var-def> <var-def name="companies"> <html-to-xml> <http url="${companiesLink}"/> </html-to-xml> </var-def> <loop item="companyLink" index="i" filter="unique"> <list> <xpath expression="//a[contains(@href,'http://finance.yahoo.com/q?s=')]/@href"> <var name="companies"/> </xpath> </list> <body> <var-def name="company" overwrite="true"> <regexp> <regexp-pattern>s=([^=]+)$</regexp-pattern> <regexp-source><var name="companyLink"/></regexp-source> <regexp-result><template>${_1}</template></regexp-result> </regexp> </var-def> <!--print text="${industry}, ${subindustry}, ${index}, ${company}"/ --> </body> </loop> </body> </loop>
</config>
I'm not sure whether there is a problem in your code, but I had similar problems with outofmem exceptions when loop-ing and using tons of data. I fixed that by, whenever it was possible enclosing everything in <empty> blocks and using <template> instead of <var>.
For example I had something like:
(It was more complicated then that but that's enough for this example.)
Constructs like this used tons of memory cause every <var> internally created variables much bigger then I needed (of type XmlNodeVariable).
I've changed that to something like this:
Memory usage dropped 4-5x.
Hope this helps you.
Last edit: Josip Maslać 2013-04-27