#191   and UTF-8 encoding

open
Marcus Baker
Web tester (52)
5
2009-12-27
2009-12-27
Anonymous
No

"assertText()" with   between Cyrillic symbols symbol failed on Windows (but worked on Linux).
Error message shows a broken string with malformed UTF-8 chars.
Problem code (I downloaded project trunk, because 1.0.1 version performs identically bad):

class SimplePage {
static function normalise($html) {
//Line 538:
$text = preg_replace('#\s+#', ' ', $text);

1) I added modifier to '#\s+#u' as quick fix for my   problem. But
class TestOfLiveBrowser extends UnitTestCase {
function testRelativeEncodedLinkFollowing() {
now fails. So the problem with whitespace in different charsets is deeper.
There are other places in code where string operations nor binary safe, nor charset-aware.

Discussion

  • $text = preg_replace('#[\040\n\r\t]+#', ' ', $text);
    is a quick fix to pass tests

     
  • truetype76
    truetype76
    2010-04-28

    Problem may come from the parser.php - SimpleHtmlSaxParser::decodeHtml() method run with ISO-8859-1.
    http://de2.php.net/manual/en/function.html-entity-decode.php
    "The ISO-8859-1 character set is used as default for the optional third charset. This defines the character set used in conversion."

    Solution may come by passing a character_set into the parser.php. But where should that come from? From the reporter.php? From the php.ini? Best solution would of course be to parse it from the page under test...