WebHarvest - web data extraction tool / Discussion / Open Discussion: Performance?

Marcin Okraszewski - 2007-04-22

Does anyone can somehow estimate overhead of Web Harvest compared to the same functions written in Java?

I was running a simple web harvest - two XPaths on every page in a site + XPath to pick all "//a/@href" elements. It turned out that my Sempron 2800+ was loaded in 50-60% by the web harvest process!!! I wonder how much it would be if I would write the same harvesting function on my own in Java. Did anyone did this kind of comparison?

BTW. There is a bug in crowling example. The set of pages to visit is overwritten after every page - only set of links from last page are taken. So, it goes something like from root to a leaf page. To fix it:

newLinks.add(fullLink) replace with unvisited.add(fullLink);

SetContextVar("unvisitedVar", newLinks) replace with SetContextVar("unvisitedVar", unvisited);

I'm not sure if 2. couldn't be skipped - is unvistedVar copy of the unvisited or references it?

Thanks,
Marcin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marcin Okraszewski - 2007-04-24
  
  I don't have good news :( I've made a test with equivalent functionality written in Java. The program was not only noticeably faster, but also CPU utilization was at 5%-15% (most of the time at about 8-10%). While with web-harvest was mostly 50%-60%.
  
  I didn't pay much attention to make the Java implementation very efficient. In particular XPath expressions were created in every loop, as it is done with Web Harvest. The links were also taken with "//a/@href" xpath. Though I tried also with Document.getElementsByTagName("a") and it had the same CPU utilization ... maybe a bit more stable.
  
  I really value the power of WebHarvest. This is a really impressive tool. But for my rather very easy use case, the performance takes important part, so I will need implement harvesting on my own :( That's really a pity. I liked the power of WebHarvest and the ease of writing complex harvesting rules.
  
  Test environment:
  CPU: AMD Sempron 2800+
  Inet connection: 1.5 Mbps
  Java 1.6 u1 on Linux.
  Default memory settings (no -Xmx or something like this).
  
  Some differences:
  HTTP Client: Apache Http Client for Web Harvest, for my Java test it was URL.openStream()
  HTML 2 XML: HtmlCleaner for both; my Java test was using 1.13, WebHarvest uses something older.
  XPath: Java test used standard Java 6.0 XPath engine; WebHarvest uses something else.
  
  I know this is no a really very serious test, rather an overview. But I hope it might be useful for someone.
  
  Best regards,
  Marcin Okraszewski
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Vladimir Nikic - 2007-04-24
    
    Thanks for your analysis. I would appriciate if you could send me part of your configuration that is doing XPath, so that I can check if performace could be improved. Basically, Web-Harvest works as a pipeline - one processor sends it's result to another one, and in most cases as a string - that's why the parsing is being done every time.
    
    Regards, Vladimir.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marcin Okraszewski - 2007-04-24
  
  Just one hint. As I looked into a code, it looked like that the XML document is several times changed to string form and parsed again. I suppose that this might be a place where some power is going out, apart from just interpreting configuration itself.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Performance?

Forums

Help

Performance?

Performance?

Forums

Help

Performance? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Performance?