#39 Handle UTF-8   correctly in Simplepage::normalise

open
nobody
None
5
2009-12-11
2009-12-11
No

Simplepage::normalise normalizes HTML, but because it uses "preg_replace('#\s+#')" it fails with a Non-breaking space (" ") encoded in UTF-8 (which is 194+160 and preg_replace removes only the second byte).
The attached patch fixes this, if mbstrings is available and $text is valid utf8 (according to mb_check_encoding).

It also adds tests.

Discussion

  • daniel hahler

    daniel hahler - 2009-12-11

    This are really several patches.. I've split them into various commit locally, and will provide a link to them later, if there's any interest after all.. ;)

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks