From: Dave V. <vie...@ku...> - 2002-07-18 22:12:56
|
This is what I've discovered so far with respect to encoding characters outside of the "normal" ASCII 7 bit range of 0...126 such as characters with diacriticals and Asian multi-byte characters. In all cases below, the server is IIS 5.0 on w2k with PHP 4.2.1 running as CGI. 1. PHP internally uses 8 bit character encoding for strings. This means that for string operations, it expects single byte characters, and hence for strings that use double byte encoding, operations such as strlen() and so forth return incorrect results. Fortunately there is a string library called mbstring() which can be set to overload the existing PHP string operations so that they do return the correct length of strings plus a few other critical operations such as extracting substrings and concatenating strings. So this library 'php_mbstring.dll' will be a requirement for running a DiGIR provider. This should not be a problem on standard win32 PHP installations, but could cause some compatibility problems on systems where PHP is typically compiled for an installation. Not really a big deal in the grand scheme of things, but just something to keep track of for installing providers. 2. PHP does correctly support UTF-8 encoding (ISO 10646). When PHP code files are stored as UTF-8, the output from test routines that include characters from a number of different character sets is correctly transmitted by PHP. So for example, a string constant within PHP that contains a mix of Japanese and Latin characters is correctly transmitted as UTF-8 to an XML output stream, and hence is correctly interpreted and rendered by the XML parser (in this case IE 6). 3. In a test where the same character string constants in (2) are stored in an Access 2002 database (which supports Unicode), the strings are correctly rendered when accessed via active server pages and ADO via OLEDB or ODBC when the response codepage is set to 65001 (utf-8) and the .asp file is stored in UTF-8 format. 4. In the same test as (3) except using PHP code, the strings retrieved from the database via the PHPADODB library or directly using the PHP COM interface are not correctly rendered. Characters with diacriticals are correctly retrieved from access 97 (non-unicode) or 2002 (unicode) records, but the Asian multi-byte characters are not, and are instead replaced with what appears to be a default character "?". I'm continuing with trying to sort out the handling of multibyte characters by PHP. Once a solution has been sorted out (looks like a combination of some PHP configuration and string translation in the scripts will resolve it), I'll provide some test code that needs to be run on the different flavors of OS, web server, and databases that we are using. Regards, Dave V. |