From: Yves K. <ph...@fi...> - 2006-12-09 12:18:05
|
Hi Matt I'm not familiar enough with fallout code to present a real fix, but i played around a bit with the conversion and utf-8 problem. The answers i found (maybe not complete) are the following: The database table's charset and collation did not affect my results. Your conversion script did work allmost fine. The characters where stored in the database either if the table was with latin1_german2_ci or with utf8_unicode_ci collation. But the text displayed on the webpage was garbage. I found, the page was displayed ok, if i change the browser to ISO-8859-1 encoding ?! So i started to investigate the connection and the output (layout). Things i tried: I sent "SET NAMES utf8" to mysql direct after 'connect'; in convert/class/Convert.php and also in pear/DB/mysql.php . This is probably not needed. (But not yet investigated). I added this line: $text = utf8_encode($text); //yok at line 66 in layout/class/Layout.php before: Layout::_loadBox($text, $module, $content_var); Thereafter the output on the website (only webpage tested) was fine. But PhpWebSite is still not fixed. If i enter a text now inside phpws (webpages) the output of it is garbage. The database content shows ä instead of the real lowercase_a_umlaut. This makes me believe, the content is html-encoded and not utf-8. Conclusion: The database seems not to be the problem, if mysql is newer than 4.1.x . But the mysql-server has to know wich encoding on the client-side is used. It handles the tables and it's collation on whatever setup. This is imho good news, because lot of users out there get a preconfigured db from the host and are not able to change charset nor collation. Since the server knows the client-encoding, he is translating the db-content to it (eg. utf-8). But what we have to do now, is to make sure, all the ingoing and outgoing content inside phpws is encoded as correct utf-8 also. Maybe we have to use a function to check the content if it is allready encoded?! eg. (not mine, somewhere from the net): /** * Checks if String is UTF-8 Encoded * @param string $string string to check * @return boolean */ function is_utf8($string) { return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string); } and convert if not: /** * Encodes String to UTF8 * @param string $string * @return string */ function cms_utf8_encode($string) { if(is_utf8($string)) { return $string; } else { if(function_exists('mb_convert_encoding')) { return mb_convert_encoding($string,'utf-8'); } else { return utf8_encode($string); } } } We should also double-check the headers and meta-tags of the output: eg: <meta http-equiv="content-type" content="application/xhtml+xml;charset=utf-8" /> eg: header('content-type: text/html; charset=utf-8'); and maybe also in css ??? eg: @charset "utf-8"; And last but not least; to work with forms, the charset should be defined: <form accept-charset="utf-8" method= ...> All this is 'only' some kind of brainstorming. But maybe the direction, where to go, to handle different languages, charsets and encodings... Regards Yves |