#9 Patch stopping HTMLParser from infinitely switching charset

open
nobody
None
5
2007-01-25
2007-01-25
No

This report contains a patch against SVN Trunk fixing various bugs related to character set switching.
The problem is that HTMLParser keeps switching charsets when it parses a META tag containing http-equiv="Content-Type". For example http://www.microsoft.com triggers this patch, but also http://www.tvix.cn/play.php?v=VKm2qLblS1k

This patch makes HTMLParser behaves as follows:
1. It uses the character encoding provided by http headers. (most webservers base this header already on the values defined in META tags in the HTML document).
2. If the parser sees a META declaration defining another charset, it only uses this charset if and only if:
- http headers do not define a charset (the charset is ISO-8859-1 in that case)
- http headers define ISO-8859-1 as charset and the META tag defines another charset. (this is recommended in the W3C specifications for HTML 4.01).

This patch solves bug #1592517
And the problem I describe on the mailinglist:
http://article.gmane.org/gmane.comp.parsers.htmlparser.user/834/match=

Discussion

  • Martin Sturm

    Martin Sturm - 2007-01-25

    Patch solving double encoding error

     
  • Martin Sturm

    Martin Sturm - 2007-01-25
    • summary: Patch stopping HTMLParser from infinetely switching charset --> Patch stopping HTMLParser from infinitely switching charset
     
  • Martin Sturm

    Martin Sturm - 2007-01-25

    Patch for version 1.6 of HTMLParser

     
  • Martin Sturm

    Martin Sturm - 2007-01-25

    Logged In: YES
    user_id=510190
    Originator: YES

    I've also created a patch for version 1.6 of HTMLParser, because I'm using that version in a project. Maybe other people can use this patch also.
    File Added: fixCharset1.6.patch

     
  • Derrick Oswald

    Derrick Oswald - 2007-03-04

    Logged In: YES
    user_id=605407
    Originator: NO

    Applied patch to version 2.0.

     

Log in to post a comment.