After a short break the work on new 2.2 version is going to be continued.
We plan to not add any new features but to concentrate on bug fixes, small improvements in already available features and major changes in system architecture.
To cut long story short the 2.2 version will cover all the things that were not delivered with version 2.1.
Depending on the number of reported defects to 2.1 we may decide to release 2.1.1 bug fixing version.
However since most of us have a new job the work will be slowed down.
In the following days I'm going to prepare a draft overview of what changes will be introduced in 2.2 and what has actually changed with 2.1 version.
Its nice to be back... read more
* html-to-xml parser is changed - HtmlCleaner is used instead of TagSoup.
* Script processor is introduced.
* template processor is now based on BeanShell instead of OGNL.
* Types are introduced in XQuery parameters.
* Few new constructors are added in class ScraperConfiguration.
* file and include processors now support both relative and absolute paths.
* Web-Harvest variables are case-sensitive from this version.
HTTP authentication supported - two new optional attributes - username and password added to http processor.
URL encoding bug fixed: special character # is no more encoded.
HTML cleanup fixed - no more default attributes are created if they don't exist in original XML.
Examples adjusted and all functional again.
Minor bug that caused command line ClassNotFound exception is now fixed.
Support for HTTPS pages with self-signed certificates added.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions.
Visit Web-Harvest home page: http://web-harvest.sourceforge.net