Re: [Htmlparser-developer] Incorrect encoding of … on Linux systems
Brought to you by:
derrickoswald
From: Arjohn K. <arj...@ad...> - 2007-04-21 05:16:22
|
Ian Macfarlane wrote: > I have encountered an interesting issue with the encoding of the > character … > > On windows, the character is correctly encoded to the three dot > character. However, on a Linux system it gets encoded to (this might > not come out right in the email) this: â?¦ > > According to the W3C doc referenced in the comments - > http://www.w3.org/TR/REC-html40/sgml/entities.html - which says: > > <!ENTITY hellip CDATA "…" -- horizontal ellipsis = three dot leader, > U+2026 ISOpub --> > > both … and … should be encoded to this ellipsis > character. However, this is not the case. > > You can get a minimal testcase of this error by doing just a > StringBean on the entity solely: > > Parser parser = new Parser(); > parser.setInputHTML("…"); > StringBean sb = new StringBean(); > parser.visitAllNodesWith(sb); > System.out.println(sb.getStrings()); > > I've worked out that it's located in Translate.decode(..), casting int > to char. This testcase shows that it doesn't work: > > int num = 8230; > char c = (char)num; > System.out.println(c); > > It would seem that it's something to do with the casting of the int to > the char that must be platform dependent, and in this case it would > seem incorrect when run on my Linux box. It's highly unlikely that a simple type cast is platform dependent. More likely cause is a character encoding issue. Either the font set that you use on your Linux installation can't render the character correctly, or the character is encoded using a wrong character set. Note that System.out.println uses the platform's default character encoding, which may not support this character. You could instead try to write the character to a file using OutputStreamWriter with UTF-8 as encoding. Then try open this file in a browser and make sure that it renders the file as UTF-8. -- Arjohn Kampman, Senior Software Engineer Aduna - Guided Exploration www.aduna-software.com |