#39 Handle UTF-8   correctly in Simplepage::normalise

open
nobody
None
5
2009-12-11
2009-12-11
daniel hahler
No

Simplepage::normalise normalizes HTML, but because it uses "preg_replace('#\s+#')" it fails with a Non-breaking space (" ") encoded in UTF-8 (which is 194+160 and preg_replace removes only the second byte).
The attached patch fixes this, if mbstrings is available and $text is valid utf8 (according to mb_check_encoding).

It also adds tests.

Discussion

  • daniel hahler
    daniel hahler
    2009-12-11

    Improved patch: this uses the page charset, extracted via getMimeType (which also looks at the raw page)

     
  • daniel hahler
    daniel hahler
    2009-12-11

    This are really several patches.. I've split them into various commit locally, and will provide a link to them later, if there's any interest after all.. ;)