Hi!
I have some problems with getting text from page.
1. Symbols ① (\u2460) and ② (\u2461) are parsed as �@ and �A from table cell.
Text parsed as tableRow.getCell(1).getLastChild().
Symbol "-" (\uff0d) are parsed as "-" as text of option tag: option.getLastChild().getNodeValue()
If href of anchor contain "tab sign" in the end of href then these symbols converted to spaces.
Ex: bla
Get href attribute: anchor.getHrefAttribute()
This value will be: "http://blablabla.com/test.html "
Is it possible to fix these problems?
Logged In: YES
user_id=402164
Originator: NO
can you provide a minimal html example as well as the encoding of the page
Logged In: YES
user_id=950730
Originator: NO
Dear Alexey,
Seems to me that HtmlUnit works fine regarding point #1 and #2, not sure if href is allowed to contain 'tab sign' '\t' or not.
Below is two methods, one to generate HTML, and the other to test HtmlUnit:
The output is:
2460
2461
FF0D
Which is correct.
I believe you should make sure you call String.codePointAt(), becuase String.charAt() depends on OutputStream (e.g. System.out) encoding.
Please advise if you have any issue.
Ahmed Ashour
Logged In: YES
user_id=402164
Originator: NO
to #3: FF seems to trim spaces and tabs from href. Alexey, is it what you expected?
Logged In: YES
user_id=1109422
Originator: NO
Ahmed seems to have addressed points 1 and 2; I've committed a fix for point 3 (htmlunit now behaves like other browsers and trims whitespace off of the href attribute). I'm closing this as fixed, but feel free to reopen or create a new bug report if there's something we've missed.