[Htmlparser-developer] Incorrect encoding of … on Linux systems
Brought to you by:
derrickoswald
From: Ian M. <ia...@ia...> - 2007-04-20 16:48:43
|
I have encountered an interesting issue with the encoding of the character … On windows, the character is correctly encoded to the three dot character. However, on a Linux system it gets encoded to (this might not come out right in the email) this: =E2?=A6 According to the W3C doc referenced in the comments - http://www.w3.org/TR/REC-html40/sgml/entities.html - which says: <!ENTITY hellip CDATA "…" -- horizontal ellipsis =3D three dot lead= er, U+2026 ISOpub --> both … and … should be encoded to this ellipsis character. However, this is not the case. You can get a minimal testcase of this error by doing just a StringBean on the entity solely: Parser parser =3D new Parser(); parser.setInputHTML("…"); StringBean sb =3D new StringBean(); parser.visitAllNodesWith(sb); System.out.println(sb.getStrings()); I've worked out that it's located in Translate.decode(..), casting int to char. This testcase shows that it doesn't work: int num =3D 8230; char c =3D (char)num; System.out.println(c); It would seem that it's something to do with the casting of the int to the char that must be platform dependent, and in this case it would seem incorrect when run on my Linux box. I'd welcome any suggestions people have as to how to fix this. Thanks Ian Macfarlane |