If fails with the following errors due to the xpath expression
[] 2006-10-19 10:12:23,828 () BaseProcessor INFO HtmlToXmlProcessor processor executed in 203ms. [main]
java.lang.reflect.InvocationTargetExceptionIWAV0052E Invocation Target Exception creating com.dnsapp.service.webparser.ParseTestPage
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.ve.internal.java.vce.launcher.remotevm.JavaBeansLauncher.main(JavaBeansLauncher.java:79)
Caused by: java.lang.NullPointerException
at net.sf.saxon.event.ReceivingContentHandler.getNameCode(ReceivingContentHandler.java:299)
at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:234)
at org.gjt.xpp.sax2.Driver.parseSubTree(Driver.java:362)
I did some investigation by taking out the above xpath expression, so it creates the out.xml file, detailed below.
What I've managed to conclude so far is that the <html-xml> construct produces an XML output which contains colspan and rowspan constructs
(note these were not in the original html file). If I take out these constructs (e.g colspan="1") if seems to parse the HTML using the xpath expression
OK.
I guess my questions are
a) why does the <html-xml> add the colspan="1" and rowspan="1" when they are actually not in the original HTML file
b) why removing the colspan and rowspan constructs seems to fix the problem
c) Is there away of stopping it generating these colspan and rowspan constructs in the parsed HTML
Thanks
Bill
XML produced by the <html-xml> expression
<?xml version="1.0" standalone="yes"?>
<html version="-//W3C//DTD HTML 4.01 Transitional//EN"><head></head><body>
I already got some complains about html cleaning. The problem is in TagSoup - dependant library responsible for transforming html to xml. I'll probably replace it with something else (perhaps JTidy). I expect to fix it in the following 5-10 days.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
Trying to to parse the following HTML, but seems to crash web-harvest all the time.
<html>
<head>
</head>
<body>
<b>DNS Checker Home Page</b>
<table border="2">
<tr><td>Col1</td><td>Col2</td></tr>
<tr><td>100</td><td>101</td></tr>
<tr><td>200</td><td>201</td></tr>
</table>
<br>
</body>
</html>
the script file I'm using is
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
</config>
If fails with the following errors due to the xpath expression
[] 2006-10-19 10:12:23,828 () BaseProcessor INFO HtmlToXmlProcessor processor executed in 203ms. [main]
java.lang.reflect.InvocationTargetExceptionIWAV0052E Invocation Target Exception creating com.dnsapp.service.webparser.ParseTestPage
Caused by: java.lang.NullPointerException
at net.sf.saxon.event.ReceivingContentHandler.getNameCode(ReceivingContentHandler.java:299)
at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:234)
at org.gjt.xpp.sax2.Driver.parseSubTree(Driver.java:362)
I did some investigation by taking out the above xpath expression, so it creates the out.xml file, detailed below.
What I've managed to conclude so far is that the <html-xml> construct produces an XML output which contains colspan and rowspan constructs
(note these were not in the original html file). If I take out these constructs (e.g colspan="1") if seems to parse the HTML using the xpath expression
OK.
I guess my questions are
a) why does the <html-xml> add the colspan="1" and rowspan="1" when they are actually not in the original HTML file
b) why removing the colspan and rowspan constructs seems to fix the problem
c) Is there away of stopping it generating these colspan and rowspan constructs in the parsed HTML
Thanks
Bill
XML produced by the <html-xml> expression
<?xml version="1.0" standalone="yes"?>
<html version="-//W3C//DTD HTML 4.01 Transitional//EN"><head></head><body>
<b>DNS Checker Home Page</b>
<table border="2"><tr><td colspan="1" rowspan="1">Col1</td><td colspan="1" rowsp
an="1">Col2</td></tr><tr><td colspan="1" rowspan="1">100</td><td colspan="1" row
span="1">101</td></tr><tr><td colspan="1" rowspan="1">200</td><td colspan="1" ro
wspan="1">201</td></tr></table>
<br clear="none"></br></body></html>
I already got some complains about html cleaning. The problem is in TagSoup - dependant library responsible for transforming html to xml. I'll probably replace it with something else (perhaps JTidy). I expect to fix it in the following 5-10 days.
Wondered if a fix was available for this.
Many Thanks
Bill
Fixed now in Web-Harvest 0.3
Excellent!
Bill