Simplepage::normalise normalizes HTML, but because it uses "preg_replace('#\s+#')" it fails with a Non-breaking space (" ") encoded in UTF-8 (which is 194+160 and preg_replace removes only the second byte).
The attached patch fixes this, if mbstrings is available and $text is valid utf8 (according to mb_check_encoding).
It also adds tests.
Patch against SVN trunk
Improved patch: this uses the page charset, extracted via getMimeType (which also looks at the raw page)
This are really several patches.. I've split them into various commit locally, and will provide a link to them later, if there's any interest after all.. ;)
Additional patch, to support charset with tags (and for getUrlsByLabel)