Menu

Cannot parse simple HTML page

Help
Bill
2006-10-19
2012-09-04
  • Bill

    Bill - 2006-10-19

    Hi

    Trying to to parse the following HTML, but seems to crash web-harvest all the time.

    <html>
    <head>
    </head>
    <body>

    <b>DNS Checker Home Page</b>
    <table border="2">
    <tr><td>Col1</td><td>Col2</td></tr>
    <tr><td>100</td><td>101</td></tr>
    <tr><td>200</td><td>201</td></tr>

    </table>

    <br>
    </body>
    </html>

    the script file I'm using is

    <?xml version="1.0" encoding="UTF-8"?>

    <config charset="ISO-8859-1">

    &lt;include path=&quot;functions.xml&quot;/&gt;
    
    &lt;file action=&quot;write&quot; path=&quot;out.xml&quot;&gt;
    &lt;xpath expression=&quot;//body&quot;&gt; 
     &lt;html-to-xml&gt;
            &lt;http url=&quot;http://localhost:8080/dns/app/ping/&quot;/&gt;
     &lt;/html-to-xml&gt;
    &lt;/xpath&gt; 
    &lt;/file&gt;
    

    </config>

    If fails with the following errors due to the xpath expression

    [] 2006-10-19 10:12:23,828 () BaseProcessor INFO HtmlToXmlProcessor processor executed in 203ms. [main]
    java.lang.reflect.InvocationTargetExceptionIWAV0052E Invocation Target Exception creating com.dnsapp.service.webparser.ParseTestPage

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.eclipse.ve.internal.java.vce.launcher.remotevm.JavaBeansLauncher.main(JavaBeansLauncher.java:79)
    

    Caused by: java.lang.NullPointerException
    at net.sf.saxon.event.ReceivingContentHandler.getNameCode(ReceivingContentHandler.java:299)
    at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:234)
    at org.gjt.xpp.sax2.Driver.parseSubTree(Driver.java:362)

    I did some investigation by taking out the above xpath expression, so it creates the out.xml file, detailed below.

    What I've managed to conclude so far is that the <html-xml> construct produces an XML output which contains colspan and rowspan constructs
    (note these were not in the original html file). If I take out these constructs (e.g colspan="1") if seems to parse the HTML using the xpath expression
    OK.

    I guess my questions are

    a) why does the <html-xml> add the colspan="1" and rowspan="1" when they are actually not in the original HTML file
    b) why removing the colspan and rowspan constructs seems to fix the problem
    c) Is there away of stopping it generating these colspan and rowspan constructs in the parsed HTML

    Thanks

    Bill

    XML produced by the <html-xml> expression

    <?xml version="1.0" standalone="yes"?>

    <html version="-//W3C//DTD HTML 4.01 Transitional//EN"><head></head><body>

    <b>DNS Checker Home Page</b>
    <table border="2"><tr><td colspan="1" rowspan="1">Col1</td><td colspan="1" rowsp
    an="1">Col2</td></tr><tr><td colspan="1" rowspan="1">100</td><td colspan="1" row
    span="1">101</td></tr><tr><td colspan="1" rowspan="1">200</td><td colspan="1" ro
    wspan="1">201</td></tr></table>

    <br clear="none"></br></body></html>

     
    • Vladimir Nikic

      Vladimir Nikic - 2006-10-19

      I already got some complains about html cleaning. The problem is in TagSoup - dependant library responsible for transforming html to xml. I'll probably replace it with something else (perhaps JTidy). I expect to fix it in the following 5-10 days.

       
    • Bill

      Bill - 2006-10-26

      Wondered if a fix was available for this.
      Many Thanks

      Bill

       
    • Vladimir Nikic

      Vladimir Nikic - 2006-10-27

      Fixed now in Web-Harvest 0.3

       
    • Bill

      Bill - 2006-10-27

      Excellent!
      Bill

       

Log in to post a comment.