WebHarvest is shipped with its own Saxon build. It seems to be v.9, but I
wasn't able to find the identical build from the Saxon web site or any public
Maven repos. What I did for myself is I took the saxon9.jar available in
WebHarvest2 sources and manually install it into my local Maven repo:
Don't try to build WebHarvest2 from sources using pom.xml from trunk, it's
far from being complete. I wrote my own based on that one and I'd love to
share it, but unfortunately I haven't received any reply from the developers.
The project seems to be abandoned :(
Here's my pom.xml to build WebHarvest2 from svn trunk:
<?xml version="1.0" encoding="UTF-8"?><projectxmlns="[url]http://maven.apache.org/POM/4.0.0[/url]"xmlns:xsi="[url]http://www.w3.org/2001/XMLSchema-instance[/url]"xsi:schemaLocation="[url]http://maven.apache.org/POM/4.0.0[/url] [url]http://maven.apache.org/maven-v4_0_0.xsd[/url]"><modelVersion>4.0.0</modelVersion><groupId>net.sourceforge.web-harvest</groupId><artifactId>web-harvest</artifactId><version>2.0.0-SNAPSHOT</version><description>Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and
extract useful data from them.
</description><url>[url]http://web-harvest.sourceforge.net/</url[/url]>
<inceptionYear>2006</inceptionYear><developers><developer><id>vnikic</id><name>Vladimir Nikic</name><roles><role>Project Admin</role><role>Developer</role></roles></developer></developers><licenses><license><name>BSD License</name><url>[url]http://www.opensource.org/licenses/bsd-license.php</url[/url]>
<distribution>repo</distribution><comments>OWNER = Vladimir Nikic
YEAR = 2006-2007
</comments></license></licenses><scm><url>[url]http://web-harvest.svn.sourceforge.net/</url[/url]>
</scm><build><sourceDirectory>src</sourceDirectory><resources><resource><directory>src</directory><includes><include>org/webharvest/gui/resources/**/*</include></includes></resource><resource><directory>licences</directory><targetPath>META-INF</targetPath><includes><include>**/*</include></includes></resource></resources><plugins><plugin><artifactId>maven-jar-plugin</artifactId><configuration><archive><manifestFile>config/MANIFEST.MF</manifestFile></archive></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><configuration><source>1.5</source><target>1.6</target><encoding>UTF-8</encoding><optimize>true</optimize><excludes><exclude>Test.java</exclude></excludes></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-source-plugin</artifactId><executions><execution><id>attach-sources</id><goals><goal>jar</goal></goals></execution></executions></plugin></plugins></build><dependencies><dependency><groupId>net.sourceforge.htmlcleaner</groupId><artifactId>htmlcleaner</artifactId><version>2.1</version></dependency><dependency><groupId>org.beanshell</groupId><artifactId>bsh</artifactId><version>2.0b4</version></dependency><dependency><groupId>commons-codec</groupId><artifactId>commons-codec</artifactId><version>1.4</version></dependency><dependency><groupId>commons-collections</groupId><artifactId>commons-collections</artifactId><version>3.2.1</version></dependency><dependency><groupId>commons-httpclient</groupId><artifactId>commons-httpclient</artifactId><version>3.1</version></dependency><dependency><groupId>commons-logging</groupId><artifactId>commons-logging</artifactId><version>1.1.1</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-email</artifactId><version>1.2</version></dependency><dependency><groupId>commons-net</groupId><artifactId>commons-net</artifactId><version>2.0</version></dependency><dependency><groupId>commons-cli</groupId><artifactId>commons-cli</artifactId><version>1.2</version></dependency><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.16</version></dependency><dependency><groupId>org.codehaus.groovy</groupId><artifactId>groovy-all</artifactId><version>1.7.4</version></dependency><dependency><groupId>rhino</groupId><artifactId>js</artifactId><version>1.7R2</version></dependency><dependency><groupId>jboss</groupId><artifactId>jnet</artifactId><version>3.2.1</version></dependency><dependency><groupId>net.sf.saxon</groupId><artifactId>saxon</artifactId><version>9</version></dependency></dependencies><repositories><repository><id>ibiblio.org</id><name>ibiblio</name><url>[url]http://mirrors.ibiblio.org/pub/mirrors/maven2/</url[/url]>
</repository></repositories></project>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not main developer, but have been using WH for some time. I created my
own version of the code which is much more efficient than the one shipped with
the main release. When I sent it to developer for inclusion in the next
release it was pretty much ignored. It is an open src project and as such you
are free to make any changes to your own repo and that could work for you. As
for your question, I believe you can use the latest saxon jar from the project
(not WH). In fact it is much better and fixes some bugs that I experienced
earlier.
Regards,
Ed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After adding a lot of manual maven installs (and going for the latest version
of saxon indeed: 9), my error is changing to another one: (at runtime:)
java.lang.NoClassDefFoundError: net/sf/saxon/trans/XPathException
This is occuring before the HTML load.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
did you use exact saxon9.jar from the WebHarvest2 sources? As I said any
other saxon builds available on their web site or public Maven repos will not
work. You need to use that particular saxon9.jar located in /lib directory
here http://web-
harvest.sourceforge.net/download/webharvest2b1-project.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now I saw in my pom.xml another dependency on saxon-xom. I removed this, and
now I get again further.
New issue is now Exception in thread "main" java.lang.NoSuchMethodError:
org.htmlcleaner.HtmlCleaner: method <init>()V not found, occuring after the
HTML read.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a config xml that runs in the gui ( webharvest_all_2.jar ).
However, in my test application, I get a NoSuchMethodError:
net.sf.saxon.query.QueryResult.serialize.
This is on the line:
v=(Variable)(scraper.getContext().get("OpportunityItems"));
What do I do wrong? Are there important differences between gui and java libs?
I use Eclipse, Maven, and Webharvest 2
NB, my pom.xml is:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.webharvest.wso2</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.0</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>saxon-xom</artifactId>
<version>8.7</version>
</dependency>
<dependency>
<groupId>org.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>1.55</version>
</dependency>
<dependency>
<groupId>bsh</groupId>
<artifactId>bsh</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.15</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>org.webharvest.wso2</id>
<name>Web Harvest Core</name>
<url>http://dist.wso2.org/maven2/</url>
<snapshots><enabled>false</enabled></snapshots>
</repository>
</repositories>
logging is working, and gives no errors.
WebHarvest is shipped with its own Saxon build. It seems to be v.9, but I
wasn't able to find the identical build from the Saxon web site or any public
Maven repos. What I did for myself is I took the saxon9.jar available in
WebHarvest2 sources and manually install it into my local Maven repo:
and then used it:
Don't try to build WebHarvest2 from sources using pom.xml from trunk, it's
far from being complete. I wrote my own based on that one and I'd love to
share it, but unfortunately I haven't received any reply from the developers.
The project seems to be abandoned :(
Here's my pom.xml to build WebHarvest2 from svn trunk:
Hi,
I am not main developer, but have been using WH for some time. I created my
own version of the code which is much more efficient than the one shipped with
the main release. When I sent it to developer for inclusion in the next
release it was pretty much ignored. It is an open src project and as such you
are free to make any changes to your own repo and that could work for you. As
for your question, I believe you can use the latest saxon jar from the project
(not WH). In fact it is much better and fixes some bugs that I experienced
earlier.
Regards,
Ed
Thanks already
After adding a lot of manual maven installs (and going for the latest version
of saxon indeed: 9), my error is changing to another one: (at runtime:)
java.lang.NoClassDefFoundError: net/sf/saxon/trans/XPathException
This is occuring before the HTML load.
did you use exact saxon9.jar from the WebHarvest2 sources? As I said any
other saxon builds available on their web site or public Maven repos will not
work. You need to use that particular saxon9.jar located in /lib directory
here http://web-
harvest.sourceforge.net/download/webharvest2b1-project.zip
This is indeed the one I am using.
Now I saw in my pom.xml another dependency on saxon-xom. I removed this, and
now I get again further.
New issue is now Exception in thread "main" java.lang.NoSuchMethodError:
org.htmlcleaner.HtmlCleaner: method <init>()V not found, occuring after the
HTML read.
Please, look at the pom.xml from my first post here and use it as a
reference. HtmlCleaner should be of the version 2.1
The POM in trunk is fixed.
To build Web-Harvest use one of the following Maven commands:
to build without external dependencies (handy for embedding into other
projects)
-OR-
with all dependencies (to run stand-alone)