[Digir-dev] php and support of multi-byte characters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

This is what I've discovered so far with respect to encoding characters
outside of the "normal" ASCII 7 bit range of 0...126 such as characters
with diacriticals and Asian multi-byte characters. 

In all cases below, the server is IIS 5.0 on w2k with PHP 4.2.1 running
as CGI.

1.  PHP internally uses 8 bit character encoding for strings.  This
means that for string operations, it expects single byte characters, and
hence for strings that use double byte encoding, operations such as
strlen() and so forth return incorrect results.  Fortunately there is a
string library called mbstring() which can be set to overload the
existing PHP string operations so that they do return the correct length
of strings plus a few other critical operations such as extracting
substrings and concatenating strings.

So this library 'php_mbstring.dll' will be a requirement for running a
DiGIR provider.  This should not be a problem on standard win32 PHP
installations, but could cause some compatibility problems on systems
where PHP is typically compiled for an installation.  Not really a big
deal in the grand scheme of things, but just something to keep track of
for installing providers.

2. PHP does correctly support UTF-8 encoding (ISO 10646).  When PHP code
files are stored as UTF-8, the output from test routines that include
characters from a number of different character sets is correctly
transmitted by PHP.  So for example, a string constant within PHP that
contains a mix of Japanese and Latin characters is correctly transmitted
as UTF-8 to an XML output stream, and hence is correctly interpreted and
rendered by the XML parser (in this case IE 6).

3. In a test where the same character string constants in (2) are stored
in an Access 2002 database (which supports Unicode), the strings are
correctly rendered when accessed via active server pages and ADO via
OLEDB or ODBC when the response codepage is set to 65001 (utf-8) and the
.asp file is stored in UTF-8 format.

4. In the same test as (3) except using PHP code, the strings retrieved
from the database via the PHPADODB library or directly using the PHP COM
interface are not correctly rendered.  Characters with diacriticals are
correctly retrieved from access 97 (non-unicode) or 2002 (unicode)
records, but the Asian multi-byte characters are not, and are instead
replaced with what appears to be a default character "?".  

I'm continuing with trying to sort out the handling of multibyte
characters by PHP.  Once a solution has been sorted out (looks like a
combination of some PHP configuration and string translation in the
scripts will resolve it), I'll provide some test code that needs to be
run on the different flavors of OS, web server, and databases that we
are using.

Regards,
  Dave V.