This report contains a patch against SVN Trunk fixing various bugs related to character set switching.
The problem is that HTMLParser keeps switching charsets when it parses a META tag containing http-equiv="Content-Type". For example http://www.microsoft.com triggers this patch, but also http://www.tvix.cn/play.php?v=VKm2qLblS1k
This patch makes HTMLParser behaves as follows:
1. It uses the character encoding provided by http headers. (most webservers base this header already on the values defined in META tags in the HTML document).
2. If the parser sees a META declaration defining another charset, it only uses this charset if and only if:
- http headers do not define a charset (the charset is ISO-8859-1 in that case)
- http headers define ISO-8859-1 as charset and the META tag defines another charset. (this is recommended in the W3C specifications for HTML 4.01).
This patch solves bug #1592517
And the problem I describe on the mailinglist:
Log in to post a comment.