HtmlUnit / Bugs / #427 Incorrectly parsed text from page

#427 Incorrectly parsed text from page

Milestone: 1.9

Status: closed

Owner: Daniel Gredler

Labels: None

Priority: 5

Updated: 2012-10-21

Created: 2007-02-12

Creator: Alexey Sadykov

Private: No

Hi!
I have some problems with getting text from page.
1. Symbols ① (\u2460) and ② (\u2461) are parsed as �@ and �A from table cell.
Text parsed as tableRow.getCell(1).getLastChild().

Symbol "－" (\uff0d) are parsed as "-" as text of option tag: option.getLastChild().getNodeValue()
If href of anchor contain "tab sign" in the end of href then these symbols converted to spaces.
Ex: bla
Get href attribute: anchor.getHrefAttribute()
This value will be: "http://blablabla.com/test.html "

Is it possible to fix these problems?

Discussion

Marc Guillemot - 2007-03-02

Logged In: YES
user_id=402164
Originator: NO

can you provide a minimal html example as well as the encoding of the page

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Logged In: YES
user_id=950730
Originator: NO

Dear Alexey,

Seems to me that HtmlUnit works fine regarding point #1 and #2, not sure if href is allowed to contain 'tab sign' '\t' or not.

Below is two methods, one to generate HTML, and the other to test HtmlUnit:
The output is:
2460
2461
FF0D

Which is correct.

I believe you should make sure you call String.codePointAt(), becuase String.charAt() depends on OutputStream (e.g. System.out) encoding.

Please advise if you have any issue.

Ahmed Ashour

public static void generateCharacters() throws IOException {
    Writer writer = new OutputStreamWriter( new FileOutputStream( "somewhere.html" ), "UTF-8" );
    writer.write( "<html>\n" );
    writer.write( "<head>\n" );
    writer.write( "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">" );
    writer.write( "</head>\n" );
    writer.write( "<body>\n" );
    writer.write( "<table id=myTable>\n" );
    writer.write( "<tr>\n" );
    writer.write( "<td>\u2460 \u2461</td>\n" );
    writer.write( "</tr>\n" );
    writer.write( "</table>\n" );
    writer.write( "<select id=mySelect>\n" );
    writer.write( "<option>" + "\uff0d" + "</option>\n" );
    writer.write( "</select>\n" );
    writer.write( "</body>\n" );
    writer.write( "</html>\n" );
    writer.close();
}

private static void testCharacters( String url ) throws Exception {
    WebClient client = new WebClient();
    CollectingAlertHandler collectingAlertHandler = new CollectingAlertHandler();
    client.setAlertHandler( collectingAlertHandler );
    HtmlPage page1 = (HtmlPage)client.getPage( url );
    HtmlTable table = (HtmlTable)page1.getHtmlElementById( "myTable" );
    String s = table.getRow( 0 ).getCell( 0 ).getLastChild().toString();
    System.out.println( Integer.toHexString( s.codePointAt( 0 ) ) );
    System.out.println( Integer.toHexString( s.codePointAt( 2 ) ) );
    HtmlSelect select = (HtmlSelect)page1.getHtmlElementById( "mySelect" );

    System.out.println( Integer.toHexString( select.getOption(0).getLastChild().getNodeValue().codePointAt( 0 ) ).toUpperCase() );
}

Marc Guillemot - 2007-04-09

Logged In: YES
user_id=402164
Originator: NO

to #3: FF seems to trim spaces and tabs from href. Alexey, is it what you expected?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Daniel Gredler - 2007-04-20

Logged In: YES
user_id=1109422
Originator: NO

Ahmed seems to have addressed points 1 and 2; I've committed a fix for point 3 (htmlunit now behaves like other browsers and trims whitespace off of the href attribute). I'm closing this as fixed, but feel free to reopen or create a new bug report if there's something we've missed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Incorrectly parsed text from page

Java GUI-Less browser, supporting JavaScript, to run against web pages

Group

Searches

Help

#427 Incorrectly parsed text from page

Discussion

Ahmed Ashour