[Htmlparser-developer] Incorrect encoding of … on Linux systems

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I have encountered an interesting issue with the encoding of the
character &#8230;

On windows, the character is correctly encoded to the three dot
character. However, on a Linux system it gets encoded to (this might
not come out right in the email) this: =E2?=A6

According to the W3C doc referenced in the comments -
http://www.w3.org/TR/REC-html40/sgml/entities.html - which says:

<!ENTITY hellip   CDATA "&#8230;" -- horizontal ellipsis =3D three dot lead=
er,
                                     U+2026 ISOpub  -->

both &hellip; and &#8230; should be encoded to this ellipsis
character. However, this is not the case.

You can get a minimal testcase of this error by doing just a
StringBean on the entity solely:

Parser parser =3D new Parser();
parser.setInputHTML("&#8230;");
StringBean sb =3D new StringBean();
parser.visitAllNodesWith(sb);
System.out.println(sb.getStrings());

I've worked out that it's located in Translate.decode(..), casting int
to char. This testcase shows that it doesn't work:

int num =3D 8230;
char c =3D (char)num;
System.out.println(c);

It would seem that it's something to do with the casting of the int to
the char that must be platform dependent, and in this case it would
seem incorrect when run on my Linux box.

I'd welcome any suggestions people have as to how to fix this.

Thanks

Ian Macfarlane

[Htmlparser-developer] Incorrect encoding of &#8230; on Linux systems

[Htmlparser-developer] Incorrect encoding of &#8230; on Linux systems

[Htmlparser-developer] Incorrect encoding of … on Linux systems

[Htmlparser-developer] Incorrect encoding of … on Linux systems