Menu

#427 Incorrectly parsed text from page

1.9
closed
None
5
2012-10-21
2007-02-12
No

Hi!
I have some problems with getting text from page.
1. Symbols ① (\u2460) and ② (\u2461) are parsed as �@ and �A from table cell.
Text parsed as tableRow.getCell(1).getLastChild().

  1. Symbol "-" (\uff0d) are parsed as "-" as text of option tag: option.getLastChild().getNodeValue()

  2. If href of anchor contain "tab sign" in the end of href then these symbols converted to spaces.
    Ex: bla
    Get href attribute: anchor.getHrefAttribute()
    This value will be: "http://blablabla.com/test.html "

Is it possible to fix these problems?

Discussion

  • Marc Guillemot

    Marc Guillemot - 2007-03-02

    Logged In: YES
    user_id=402164
    Originator: NO

    can you provide a minimal html example as well as the encoding of the page

     
  • Ahmed Ashour

    Ahmed Ashour - 2007-04-08

    Logged In: YES
    user_id=950730
    Originator: NO

    Dear Alexey,

    Seems to me that HtmlUnit works fine regarding point #1 and #2, not sure if href is allowed to contain 'tab sign' '\t' or not.

    Below is two methods, one to generate HTML, and the other to test HtmlUnit:
    The output is:
    2460
    2461
    FF0D

    Which is correct.

    I believe you should make sure you call String.codePointAt(), becuase String.charAt() depends on OutputStream (e.g. System.out) encoding.

    Please advise if you have any issue.

    Ahmed Ashour

    public static void generateCharacters() throws IOException {
        Writer writer = new OutputStreamWriter( new FileOutputStream( "somewhere.html" ), "UTF-8" );
        writer.write( "<html>\n" );
        writer.write( "<head>\n" );
        writer.write( "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">" );
        writer.write( "</head>\n" );
        writer.write( "<body>\n" );
        writer.write( "<table id=myTable>\n" );
        writer.write( "<tr>\n" );
        writer.write( "<td>\u2460 \u2461</td>\n" );
        writer.write( "</tr>\n" );
        writer.write( "</table>\n" );
        writer.write( "<select id=mySelect>\n" );
        writer.write( "<option>" + "\uff0d" + "</option>\n" );
        writer.write( "</select>\n" );
        writer.write( "</body>\n" );
        writer.write( "</html>\n" );
        writer.close();
    }
    

    private static void testCharacters( String url ) throws Exception {
        WebClient client = new WebClient();
        CollectingAlertHandler collectingAlertHandler = new CollectingAlertHandler();
        client.setAlertHandler( collectingAlertHandler );
        HtmlPage page1 = (HtmlPage)client.getPage( url );
        HtmlTable table = (HtmlTable)page1.getHtmlElementById( "myTable" );
        String s = table.getRow( 0 ).getCell( 0 ).getLastChild().toString();
        System.out.println( Integer.toHexString( s.codePointAt( 0 ) ) );
        System.out.println( Integer.toHexString( s.codePointAt( 2 ) ) );
        HtmlSelect select = (HtmlSelect)page1.getHtmlElementById( "mySelect" );
    
        System.out.println( Integer.toHexString( select.getOption(0).getLastChild().getNodeValue().codePointAt( 0 ) ).toUpperCase() );
    }
    
     
  • Marc Guillemot

    Marc Guillemot - 2007-04-09

    Logged In: YES
    user_id=402164
    Originator: NO

    to #3: FF seems to trim spaces and tabs from href. Alexey, is it what you expected?

     
  • Daniel Gredler

    Daniel Gredler - 2007-04-20

    Logged In: YES
    user_id=1109422
    Originator: NO

    Ahmed seems to have addressed points 1 and 2; I've committed a fix for point 3 (htmlunit now behaves like other browsers and trims whitespace off of the href attribute). I'm closing this as fixed, but feel free to reopen or create a new bug report if there's something we've missed.

     

Log in to post a comment.