WebHarvest - web data extraction tool / Discussion / Help: Cannot parse simple HTML page

Trying to to parse the following HTML, but seems to crash web-harvest all the time.

<b>DNS Checker Home Page</b>
<table border="2">
<tr><td>Col1</td><td>Col2</td></tr>
<tr><td>100</td><td>101</td></tr>
<tr><td>200</td><td>201</td></tr>

</table>

the script file I'm using is

<?xml version="1.0" encoding="UTF-8"?>

&lt;include path=&quot;functions.xml&quot;/&gt;

&lt;file action=&quot;write&quot; path=&quot;out.xml&quot;&gt;
&lt;xpath expression=&quot;//body&quot;&gt; 
 &lt;html-to-xml&gt;
        &lt;http url=&quot;http://localhost:8080/dns/app/ping/&quot;/&gt;
 &lt;/html-to-xml&gt;
&lt;/xpath&gt; 
&lt;/file&gt;

</config>

If fails with the following errors due to the xpath expression

[] 2006-10-19 10:12:23,828 () BaseProcessor INFO HtmlToXmlProcessor processor executed in 203ms. [main]
java.lang.reflect.InvocationTargetExceptionIWAV0052E Invocation Target Exception creating com.dnsapp.service.webparser.ParseTestPage

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.ve.internal.java.vce.launcher.remotevm.JavaBeansLauncher.main(JavaBeansLauncher.java:79)

Caused by: java.lang.NullPointerException
at net.sf.saxon.event.ReceivingContentHandler.getNameCode(ReceivingContentHandler.java:299)
at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:234)
at org.gjt.xpp.sax2.Driver.parseSubTree(Driver.java:362)

I did some investigation by taking out the above xpath expression, so it creates the out.xml file, detailed below.

What I've managed to conclude so far is that the <html-xml> construct produces an XML output which contains colspan and rowspan constructs
(note these were not in the original html file). If I take out these constructs (e.g colspan="1") if seems to parse the HTML using the xpath expression
OK.

I guess my questions are

a) why does the <html-xml> add the colspan="1" and rowspan="1" when they are actually not in the original HTML file
b) why removing the colspan and rowspan constructs seems to fix the problem
c) Is there away of stopping it generating these colspan and rowspan constructs in the parsed HTML

Thanks

Bill

XML produced by the <html-xml> expression

<?xml version="1.0" standalone="yes"?>

<b>DNS Checker Home Page</b>
<table border="2"><tr><td colspan="1" rowspan="1">Col1</td><td colspan="1" rowsp
an="1">Col2</td></tr><tr><td colspan="1" rowspan="1">100</td><td colspan="1" row
span="1">101</td></tr><tr><td colspan="1" rowspan="1">200</td><td colspan="1" ro
wspan="1">201</td></tr></table>

Cannot parse simple HTML page

Forums

Help

Cannot parse simple HTML page document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Cannot parse simple HTML page