Wrong encoding with default + specified charset

Brought to you by: andyc2, mguillem

#146 Wrong encoding with default + specified charset

Status: open

Owner: nobody

Labels: scanner (58)

Priority: 5

Updated: 2013-01-29

Created: 2013-01-29

Creator: qqilihq

Private: No

We have set NekoHTML to use UTF-8 as default, as this works best for our use case. When parsing webpages with cyrillic charset though, some text gets misinterpreted. Example as follows:

DOMParser parser = new DOMParser(new HTMLConfiguration());
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "UTF-8");
parser.parse("http://www.kommersant.ru");

-------

Produces:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><HTML xmlns="http://www.w3.org/1999/xhtml" id="nojs"><HEAD>
<TITLE>�� </TITLE>
<META content="IE=edge" http-equiv="X-UA-Compatible"/>
<SCRIPT type="text/javascript"> document.documentElement.id = "js"; </SCRIPT>
<LINK href="http://www.kommersant.ru/favicon.ico" rel="Shortcut Icon"/>
<META content="�� " name="title"/>
<META content="�� ." name="description"/>
<META content="��,��,��,��,��,��,��,��,��,��,��,��,��,��,��,��,��,�� ,�� ,�� ,�� ,��,��,Weekend,�� ,�� ,�� " name="keywords"/>
<LINK href="http://www.kommersant.ru/content/pics/kommlogo100x75.jpg" rel="image_src"/>

[…]

-------

Notice, that the cyrillic characters are scrambled *until* the META tag specifying the windows -1251 charset has been read. From my understanding and looking at the source, the input should be re-read after an encoding has been detected, or am I wrong?

Wrong encoding with default + specified charset

Group

Searches

Help

#146 Wrong encoding with default + specified charset

Discussion