WebHarvest - web data extraction tool News

Status: Beta

Brought to you by: rbala, vnikic

WebHarvest - web data extraction tool / News: Recent posts

Inception phase for 2.2 version

After a short break the work on new 2.2 version is going to be continued.
We plan to not add any new features but to concentrate on bug fixes, small improvements in already available features and major changes in system architecture.
To cut long story short the 2.2 version will cover all the things that were not delivered with version 2.1.
Depending on the number of reported defects to 2.1 we may decide to release 2.1.1 bug fixing version.
However since most of us have a new job the work will be slowed down.
In the following days I'm going to prepare a draft overview of what changes will be introduced in 2.2 and what has actually changed with 2.1 version.
Its nice to be back... read more

Posted by 2013-03-03

Web-Harvest 1.0 released

GUI is introduced.
html-to-xml processor exposes attributes for controlling cleaner's behaviour.
More scripting languages and features supported.
Access to HttpClient in runtime supported.
Number of other improvements and fixes.

Posted by 2007-10-17

SVN support is added.

Lastest source may be checked out from https://web-harvest.svn.sourceforge.net/svnroot/web-harvest.
Source can be browsed at http://web-harvest.svn.sourceforge.net/viewvc/web-harvest/

Posted by 2007-04-16

Web-Harvest 0.5 released

*  html-to-xml parser is changed - HtmlCleaner is used instead of TagSoup.
* Script processor is introduced.
* template processor is now based on BeanShell instead of OGNL.
* Types are introduced in XQuery parameters.
* Few new constructors are added in class ScraperConfiguration.
* file and include processors now support both relative and absolute paths.
* Web-Harvest variables are case-sensitive from this version.

Posted by 2007-01-16

Web-Harvest 0.3 released

HTTP authentication supported - two new optional attributes - username and password added to http processor.
URL encoding bug fixed: special character # is no more encoded.
HTML cleanup fixed - no more default attributes are created if they don't exist in original XML.
Examples adjusted and all functional again.

Posted by 2006-10-27

Web-Harvest 0.261 released

Minor bug that caused command line ClassNotFound exception is now fixed.

Posted by 2006-10-12

Web-Harvest 0.26 released

URL encoding bug fixed.

Posted by 2006-09-28

Web-Harvest 0.25 released

Support for HTTP proxy credentials added.

Posted by 2006-09-22

Web-Harvest 0.24 released

Relative redirection URLs bug fixed.

Posted by 2006-09-13

Web-Harvest 0.23 released

Support for HTTPS pages with self-signed certificates added.

Posted by 2006-09-07

Web-Harvest 0.22 released

Support for HTTP proxies added.

Posted by 2006-09-06

Web-Harvest 0.21 released

Circular redirection in HTTP client enabled.

Posted by 2006-09-06

Web-Harvest licence changed to BSD

Web-Harvest licence changed to BSD.

Posted by 2006-09-04

Web-Harvest v0.2 released

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions.

Visit Web-Harvest home page: http://web-harvest.sourceforge.net

Posted by 2006-08-31