Menu

#1825 HtmlPage.asText() outputs html code

2.23
closed
RBRi
None
1
2016-09-30
2016-09-23
No

Test code:

        WebClient client=new WebClient();
        client.getOptions().setThrowExceptionOnScriptError(false);
        HtmlPage p=client.getPage("http://news.sznews.com/content/2016-09/21/content_13889421.htm");
        logger.info(p.asText());

It was good in htmlunit 2.14

Discussion

  • RBRi

    RBRi - 2016-09-23

    Did a quick analysis. The dom tree of the page contains a text node with that html-content as text. This might be the result of a html parser problem or more likely some javascript problem.

    If you are able to point to the problematic code i will try to fix it. But the page is far to complex and im not able to read/understand all this chars :-).
    Sorry our time is limited; you have to help us a bit with this problem.

     
  • Rural Hunter

    Rural Hunter - 2016-09-23

    OK, I will try to digger more. thanks.

     
  • Rural Hunter

    Rural Hunter - 2016-09-27

    I reproduced the problem with this minimal html:

    <html>
        <head>
            <title>aaaaaaa</title>
        </head>
        <body>
            aaaaa
            <div>
                <p><iframe /></p>                                   
            </div>
        </body>
    </html>
    

    It's the self closed iframe node causing the problem.

     
  • RBRi

    RBRi - 2016-09-27

    Have added a simple test case for this and it looks like the browsers are failing in the same way. Please verify.

     
  • RBRi

    RBRi - 2016-09-27
    • status: open --> pending
    • assigned_to: RBRi
     
  • Rural Hunter

    Rural Hunter - 2016-09-29

    I don't understand what do you mean 'browsers are failing in the same way'. I open the test page in firefox and it shows the text 'aaaaa' and an empty iframe. In htmlunit, the asText() method outputs html code:

    aaaaa
    </p> </div> </body> </html>
    
     
  • RBRi

    RBRi - 2016-09-29

    Try this with your real browser i hope this helps.
    And please have a look at at commit 13004.

    <html>
    <head>
    <title>t</title>
    </head>
    <body>
      abc
      <div>
        def
        <p>
          ghi
          <iframe />
          jkl
        </p>
        mno
      </div>
      pqr
    </body>
    </html>
    
     
  • Rural Hunter

    Rural Hunter - 2016-09-29

    I tried your test case too and it's same for me. My point is that, with real browser, I don't see any html code. With asText() in htmlunit, I see the html code:

    abc
    def
    ghi jkl </p> mno </div> pqr </body> </html>
    
     
    • RBRi

      RBRi - 2016-09-29

      Ah ok, now i got your point (hopefully). Will have a look.

       
  • RBRi

    RBRi - 2016-09-29

    Think this is fixed now. Sorry for the long journey until i got your point.

     
  • Rural Hunter

    Rural Hunter - 2016-09-30

    Thanks. I confirm it's fixed. Sorry I didn't make my point clear enough.

     
  • RBRi

    RBRi - 2016-09-30
    • status: pending --> closed
     

Log in to post a comment.

Auth0 Logo