document charset not used to decode document

Brought to you by: graaff

#13 document charset not used to decode document

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-07-24

Created: 2009-07-24

Creator: Paul Merchant, Jr.

Private: No

The various HTML and XHTML standards provide for a document to supply a character encoding in a <meta> tag in instances where a server may not provide a Content-Type header. Checkbot 1.80 ignores ContentTypes specified in <meta> tags and generats "Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/local/lib/perl5/site_perl/5.8.8/HTTP/Message.pm line 264" warnings in documents which are not ISO-8859-1 encoded when no Content-Type is returned by the server.

I propose a patch similar to the following to look for a Content-Type meta tag to determine a default charset before attempting to retrieve the decoded document:

--- checkbot 2008-10-15 08:55:01.000000000 -0400
+++ checkbot.new 2009-07-24 13:03:47.403214000 -0400
@@ -1180,7 +1180,17 @@
# If charset information is missing then decoded_content doesn't
# work. Fall back to content in this case, even though that may lead
# to charset warnings. See bug 1665075 for reference.
- my $content = $response->decoded_content || $response->content;
+
+ # See if the document contains a default character set that
+ # should be used when attempting to decode the document.
+ my $content = $response->content;
+ if ($content =~ m/<meta\s+([^>]*http-equiv=(:?"|')Content-Type(:?"|')[^>]*)>/si
+ && $1 =~ m/charset=([^\s"']+)/si) {
+ $content = $response->decoded_content('default_charset' => $1) || $response->content;
+ }
+ else {
+ $content = $response->decoded_content || $response->content;
+ }
$p->parse($content);
$p->eof;

It would also be handy if checkbot provided a command-line argument for specifying a default character set when no character set is provided by the server or within the document.

document charset not used to decode document

Group

Searches

Help

#13 document charset not used to decode document

Discussion