This report contains a patch against SVN Trunk fixing various bugs related to character set switching.
The problem is that HTMLParser keeps switching charsets when it parses a META tag containing http-equiv="Content-Type". For example http://www.microsoft.com triggers this patch, but also http://www.tvix.cn/play.php?v=VKm2qLblS1k
This patch makes HTMLParser behaves as follows:
1. It uses the character encoding provided by http headers. (most webservers base this header already on the values defined in META tags in the HTML document).
2. If the parser sees a META declaration defining another charset, it only uses this charset if and only if:
- http headers do not define a charset (the charset is ISO-8859-1 in that case)
- http headers define ISO-8859-1 as charset and the META tag defines another charset. (this is recommended in the W3C specifications for HTML 4.01).
This patch solves bug #1592517
And the problem I describe on the mailinglist:
http://article.gmane.org/gmane.comp.parsers.htmlparser.user/834/match=
Patch solving double encoding error
Patch for version 1.6 of HTMLParser
Logged In: YES
user_id=510190
Originator: YES
I've also created a patch for version 1.6 of HTMLParser, because I'm using that version in a project. Maybe other people can use this patch also.
File Added: fixCharset1.6.patch
Logged In: YES
user_id=605407
Originator: NO
Applied patch to version 2.0.