HtmlUnit / Bugs / #1311 Wrong default content encoding for XML files

#1311 Wrong default content encoding for XML files

Status: closed

Owner: RBRi

Labels: None

Priority: 5

Updated: 2014-08-12

Created: 2011-08-03

Creator:

Private: No

HtmlUnit uses Text.DEFAULT_CHARSET as default encoding if no clues can be found from neither the HTTP header, the BOM, nor the content. For HTML pages this might be correct, but for XML pages (like RSS) this is not according to the standard (According to http://www.w3.org/TR/xml/#charencoding (or in a more readable format: http://www.opentag.com/xfaq_enc.htm#enc_default)

This would require at least changes to:

public String getContentType() {
final String contentTypeHeader = getResponseHeaderValue("content-type");
if (contentTypeHeader == null) {
// Not technically legal but some servers don't return a content-type
return "";
}
final int index = contentTypeHeader.indexOf(';');
if (index == -1) {
return contentTypeHeader;
}
return contentTypeHeader.substring(0, index);
}

Discussion

RBRi - 2012-04-28

Now fixed in SVN. Thanks for reporting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wrong default content encoding for XML files

Java GUI-Less browser, supporting JavaScript, to run against web pages

Group

Searches

Help

#1311 Wrong default content encoding for XML files

Discussion