I have been successfully using web harvest for scraping for some time.
I have lately discovered an issue regarding scraping multiple pages and processing them with some xpath expressions.
I did some basic profiling and apparently the class
org.webharvest.runtime.variables.NodeVariable is the one that pumps up with every downloaded page resulting in the end an OutOfMemory exception for the scraping process.
Maybe there is a way to fix this issue.
Kind regards,
Mile
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have read the thread and used java 1.6 rev 2. It's with this VM that I have the problem.
The workaround I used was to divide the scraping in batches.
Kind regards,
Mile
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I have been successfully using web harvest for scraping for some time.
I have lately discovered an issue regarding scraping multiple pages and processing them with some xpath expressions.
I did some basic profiling and apparently the class
org.webharvest.runtime.variables.NodeVariable is the one that pumps up with every downloaded page resulting in the end an OutOfMemory exception for the scraping process.
Maybe there is a way to fix this issue.
Kind regards,
Mile
Mile,
I had some OutOfMemory exceptions too. Please see this post: "Way to ignore tags?" http://sourceforge.net/forum/forum.php?thread_id=1797536&forum_id=591299
See Vladimir's suggestions. If you are using java 1.4, try upgrading to 1.6 (it helped for me).
Steve
Hi Steve,
I have read the thread and used java 1.6 rev 2. It's with this VM that I have the problem.
The workaround I used was to divide the scraping in batches.
Kind regards,
Mile