Thread: [Htmlparser-developer] Incorrect encoding of … on Linux systems
Brought to you by:
derrickoswald
From: Ian M. <ia...@ia...> - 2007-04-20 16:48:43
|
I have encountered an interesting issue with the encoding of the character … On windows, the character is correctly encoded to the three dot character. However, on a Linux system it gets encoded to (this might not come out right in the email) this: =E2?=A6 According to the W3C doc referenced in the comments - http://www.w3.org/TR/REC-html40/sgml/entities.html - which says: <!ENTITY hellip CDATA "…" -- horizontal ellipsis =3D three dot lead= er, U+2026 ISOpub --> both … and … should be encoded to this ellipsis character. However, this is not the case. You can get a minimal testcase of this error by doing just a StringBean on the entity solely: Parser parser =3D new Parser(); parser.setInputHTML("…"); StringBean sb =3D new StringBean(); parser.visitAllNodesWith(sb); System.out.println(sb.getStrings()); I've worked out that it's located in Translate.decode(..), casting int to char. This testcase shows that it doesn't work: int num =3D 8230; char c =3D (char)num; System.out.println(c); It would seem that it's something to do with the casting of the int to the char that must be platform dependent, and in this case it would seem incorrect when run on my Linux box. I'd welcome any suggestions people have as to how to fix this. Thanks Ian Macfarlane |
From: Arjohn K. <arj...@ad...> - 2007-04-21 05:16:22
|
Ian Macfarlane wrote: > I have encountered an interesting issue with the encoding of the > character … > > On windows, the character is correctly encoded to the three dot > character. However, on a Linux system it gets encoded to (this might > not come out right in the email) this: â?¦ > > According to the W3C doc referenced in the comments - > http://www.w3.org/TR/REC-html40/sgml/entities.html - which says: > > <!ENTITY hellip CDATA "…" -- horizontal ellipsis = three dot leader, > U+2026 ISOpub --> > > both … and … should be encoded to this ellipsis > character. However, this is not the case. > > You can get a minimal testcase of this error by doing just a > StringBean on the entity solely: > > Parser parser = new Parser(); > parser.setInputHTML("…"); > StringBean sb = new StringBean(); > parser.visitAllNodesWith(sb); > System.out.println(sb.getStrings()); > > I've worked out that it's located in Translate.decode(..), casting int > to char. This testcase shows that it doesn't work: > > int num = 8230; > char c = (char)num; > System.out.println(c); > > It would seem that it's something to do with the casting of the int to > the char that must be platform dependent, and in this case it would > seem incorrect when run on my Linux box. It's highly unlikely that a simple type cast is platform dependent. More likely cause is a character encoding issue. Either the font set that you use on your Linux installation can't render the character correctly, or the character is encoded using a wrong character set. Note that System.out.println uses the platform's default character encoding, which may not support this character. You could instead try to write the character to a file using OutputStreamWriter with UTF-8 as encoding. Then try open this file in a browser and make sure that it renders the file as UTF-8. -- Arjohn Kampman, Senior Software Engineer Aduna - Guided Exploration www.aduna-software.com |
From: Nie H. <ked...@16...> - 2007-07-18 03:37:17
|
DQotLS0tLSBPcmlnaW5hbCBNZXNzYWdlIC0tLS0tIA0KRnJvbTogIklhbiBNYWNmYXJsYW5lIiA8 aWFuQGlhbm1hY2ZhcmxhbmUuY29tPg0KVG86IDxodG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5z b3VyY2Vmb3JnZS5uZXQ+DQpTZW50OiBTYXR1cmRheSwgQXByaWwgMjEsIDIwMDcgMTI6NDggQU0N ClN1YmplY3Q6IFtIdG1scGFyc2VyLWRldmVsb3Blcl0gSW5jb3JyZWN0IGVuY29kaW5nIG9mICYj ODIzMDtvbiBMaW51eCBzeXN0ZW1zDQoNCg0KSSBoYXZlIGVuY291bnRlcmVkIGFuIGludGVyZXN0 aW5nIGlzc3VlIHdpdGggdGhlIGVuY29kaW5nIG9mIHRoZQ0KY2hhcmFjdGVyICYjODIzMDsNCg0K T24gd2luZG93cywgdGhlIGNoYXJhY3RlciBpcyBjb3JyZWN0bHkgZW5jb2RlZCB0byB0aGUgdGhy ZWUgZG90DQpjaGFyYWN0ZXIuIEhvd2V2ZXIsIG9uIGEgTGludXggc3lzdGVtIGl0IGdldHMgZW5j b2RlZCB0byAodGhpcyBtaWdodA0Kbm90IGNvbWUgb3V0IHJpZ2h0IGluIHRoZSBlbWFpbCkgdGhp czog4j+mDQoNCkFjY29yZGluZyB0byB0aGUgVzNDIGRvYyByZWZlcmVuY2VkIGluIHRoZSBjb21t ZW50cyAtDQpodHRwOi8vd3d3LnczLm9yZy9UUi9SRUMtaHRtbDQwL3NnbWwvZW50aXRpZXMuaHRt bCAtIHdoaWNoIHNheXM6DQoNCjwhRU5USVRZIGhlbGxpcCAgIENEQVRBICImIzgyMzA7IiAtLSBo b3Jpem9udGFsIGVsbGlwc2lzID0gdGhyZWUgZG90IGxlYWRlciwNCiAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICBVKzIwMjYgSVNPcHViICAtLT4NCg0KYm90aCAmaGVsbGlwOyBh bmQgJiM4MjMwOyBzaG91bGQgYmUgZW5jb2RlZCB0byB0aGlzIGVsbGlwc2lzDQpjaGFyYWN0ZXIu IEhvd2V2ZXIsIHRoaXMgaXMgbm90IHRoZSBjYXNlLg0KDQpZb3UgY2FuIGdldCBhIG1pbmltYWwg dGVzdGNhc2Ugb2YgdGhpcyBlcnJvciBieSBkb2luZyBqdXN0IGENClN0cmluZ0JlYW4gb24gdGhl IGVudGl0eSBzb2xlbHk6DQoNClBhcnNlciBwYXJzZXIgPSBuZXcgUGFyc2VyKCk7DQpwYXJzZXIu c2V0SW5wdXRIVE1MKCImIzgyMzA7Iik7DQpTdHJpbmdCZWFuIHNiID0gbmV3IFN0cmluZ0JlYW4o KTsNCnBhcnNlci52aXNpdEFsbE5vZGVzV2l0aChzYik7DQpTeXN0ZW0ub3V0LnByaW50bG4oc2Iu Z2V0U3RyaW5ncygpKTsNCg0KSSd2ZSB3b3JrZWQgb3V0IHRoYXQgaXQncyBsb2NhdGVkIGluIFRy YW5zbGF0ZS5kZWNvZGUoLi4pLCBjYXN0aW5nIGludA0KdG8gY2hhci4gVGhpcyB0ZXN0Y2FzZSBz aG93cyB0aGF0IGl0IGRvZXNuJ3Qgd29yazoNCg0KaW50IG51bSA9IDgyMzA7DQpjaGFyIGMgPSAo Y2hhciludW07DQpTeXN0ZW0ub3V0LnByaW50bG4oYyk7DQoNCkl0IHdvdWxkIHNlZW0gdGhhdCBp dCdzIHNvbWV0aGluZyB0byBkbyB3aXRoIHRoZSBjYXN0aW5nIG9mIHRoZSBpbnQgdG8NCnRoZSBj aGFyIHRoYXQgbXVzdCBiZSBwbGF0Zm9ybSBkZXBlbmRlbnQsIGFuZCBpbiB0aGlzIGNhc2UgaXQg d291bGQNCnNlZW0gaW5jb3JyZWN0IHdoZW4gcnVuIG9uIG15IExpbnV4IGJveC4NCg0KSSdkIHdl bGNvbWUgYW55IHN1Z2dlc3Rpb25zIHBlb3BsZSBoYXZlIGFzIHRvIGhvdyB0byBmaXggdGhpcy4N Cg0KVGhhbmtzDQoNCklhbiBNYWNmYXJsYW5lDQoNCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0NClRoaXMgU0Yu bmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieSBEQjIgRXhwcmVzcw0KRG93bmxvYWQgREIyIEV4cHJl c3MgQyAtIHRoZSBGUkVFIHZlcnNpb24gb2YgREIyIGV4cHJlc3MgYW5kIHRha2UNCmNvbnRyb2wg b2YgeW91ciBYTUwuIE5vIGxpbWl0cy4gSnVzdCBkYXRhLiBDbGljayB0byBnZXQgaXQgbm93Lg0K aHR0cDovL3NvdXJjZWZvcmdlLm5ldC9wb3dlcmJhci9kYjIvDQpfX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fXw0KSHRtbHBhcnNlci1kZXZlbG9wZXIgbWFpbGlu ZyBsaXN0DQpIdG1scGFyc2VyLWRldmVsb3BlckBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQNCmh0dHBz Oi8vbGlzdHMuc291cmNlZm9yZ2UubmV0L2xpc3RzL2xpc3RpbmZvL2h0bWxwYXJzZXItZGV2ZWxv cGVyDQo= |