Mac OS X HFS+ partitions and UDF disks use unicode to store filenames but MacCVS Pro needs to be updated to take advantage of this (as well as filenames > 31 characters). MacCVS Pro does appear to correctly handle iso-8859-1 characters within documents just not the filenames yet, there are some notes in MacCVS sourceforge about the problem but no patches or updates yet for the long (unicode) filenames. The command-line CVS client for Mac OS X also seems to suffer from the same problem, but it is not the latest available version.
It would be unfortunate to have to remove extended ascii characters from the filenames. As a transition to unicode the W3C recommends URIs should be encoded with UTF-8 then non-ascii characters escaped with 4 digit hex %hh%hh <http://www.w3.org/International/O-URL-and-ident> but this is not yet implemented widely, instead being encoded with a subset of ascii according to rfc2396. However, just because URIs are encoded with a subset of ascii doesn't necessarily mean that filenames must also be.
I wonder would it add an extra level of complexity to the Wiki code to translate filenames, and how will different web servers serve pages whose filenames are already encoded? Browsers are supposed to automatically translate the text in the URL field to replace % sequences with native text--and only use the % while communicating with other hosts--but I believe many do not consistently do so, I have seen some unfortunate web pages with strange double-encoded file names (%hh encoded a second time as %hh%hh%hh).
The best choice for Wiki Internationalization may be to convert all text strings and database content to unicode. I'm sure this would present its own additional problems within PHP. Maybe unicode could be planned for a version 3 or 4 of Wiki.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The filename issue, I think, is not such a big deal in itself. The code already urldecode()s filenames before interpreting them as wiki page names. So renaming the file to the urlencode()d version works just fine.
In other words, limiting the PhpWiki source file names to plain ASCII does not limit wiki page names or wiki page content to ASCII.
And you're most certainly correct that moving to Unicode is the way to go. The problem is that you're also right about problems with PHP. It's really too big of a headache (for me at least) to start thinking about moving PhpWiki to unicode until PHP's regular expression and string functions deal with unicode well. (I don't think they do, yet --- but correct me if I'm wrong.)
(Another issue is support for unicode in whatever SQL servers we want to use. I.e.: How well are text searches through fields with UTF-8 data supported?)
So, for now, stock PhpWiki supports only iso-8859-1 (though one could easily modify the code to work with any eight-bit encoding, I think)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found some information about Unicode & UTF-8 at the php.net web site. There is a function which supposedly is able to set the encoding to UNICODE. Another page talks about an extended module for PHP which supports UTF-8 but not unicode. I'm not sure how this all translates into whether or not unicode or UTF-8 can be used with phpwiki.
mbstring is an experimental extended PHP module. It must be enabled with configure script:
--enable-mbstring : Enable mbstring functions. This option is required to use mbstring functions.
--enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using mbstring conversion engine. If this feature is enabled, HTTP input character encoding may be converted to mbstring.internal_encoding automatically.
Description
int pg_set_client_encoding ([int connection, string encoding])
The function set the client encoding and return 0 if success or -1 if error.
encoding is the client encoding and can be either : SQL_ASCII, EUC_JP, EUC_CN, EUC_KR, EUC_TW, UNICODE, MULE_INTERNAL, LATINX (X=1...9), KOI8, WIN, ALT, SJIS, BIG5, WIN1250.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Discussion about future of international version o Wiki, how does http and php handle unicode, utf-8 and non-ascii?
Mac OS X HFS+ partitions and UDF disks use unicode to store filenames but MacCVS Pro needs to be updated to take advantage of this (as well as filenames > 31 characters). MacCVS Pro does appear to correctly handle iso-8859-1 characters within documents just not the filenames yet, there are some notes in MacCVS sourceforge about the problem but no patches or updates yet for the long (unicode) filenames. The command-line CVS client for Mac OS X also seems to suffer from the same problem, but it is not the latest available version.
It would be unfortunate to have to remove extended ascii characters from the filenames. As a transition to unicode the W3C recommends URIs should be encoded with UTF-8 then non-ascii characters escaped with 4 digit hex %hh%hh <http://www.w3.org/International/O-URL-and-ident> but this is not yet implemented widely, instead being encoded with a subset of ascii according to rfc2396. However, just because URIs are encoded with a subset of ascii doesn't necessarily mean that filenames must also be.
I wonder would it add an extra level of complexity to the Wiki code to translate filenames, and how will different web servers serve pages whose filenames are already encoded? Browsers are supposed to automatically translate the text in the URL field to replace % sequences with native text--and only use the % while communicating with other hosts--but I believe many do not consistently do so, I have seen some unfortunate web pages with strange double-encoded file names (%hh encoded a second time as %hh%hh%hh).
The best choice for Wiki Internationalization may be to convert all text strings and database content to unicode. I'm sure this would present its own additional problems within PHP. Maybe unicode could be planned for a version 3 or 4 of Wiki.
The filename issue, I think, is not such a big deal in itself. The code already urldecode()s filenames before interpreting them as wiki page names. So renaming the file to the urlencode()d version works just fine.
In other words, limiting the PhpWiki source file names to plain ASCII does not limit wiki page names or wiki page content to ASCII.
And you're most certainly correct that moving to Unicode is the way to go. The problem is that you're also right about problems with PHP. It's really too big of a headache (for me at least) to start thinking about moving PhpWiki to unicode until PHP's regular expression and string functions deal with unicode well. (I don't think they do, yet --- but correct me if I'm wrong.)
(Another issue is support for unicode in whatever SQL servers we want to use. I.e.: How well are text searches through fields with UTF-8 data supported?)
So, for now, stock PhpWiki supports only iso-8859-1 (though one could easily modify the code to work with any eight-bit encoding, I think)
I found some information about Unicode & UTF-8 at the php.net web site. There is a function which supposedly is able to set the encoding to UNICODE. Another page talks about an extended module for PHP which supports UTF-8 but not unicode. I'm not sure how this all translates into whether or not unicode or UTF-8 can be used with phpwiki.
http://www.php.net/manual/en/ref.mbstring.php
mbstring is an experimental extended PHP module. It must be enabled with configure script:
--enable-mbstring : Enable mbstring functions. This option is required to use mbstring functions.
--enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using mbstring conversion engine. If this feature is enabled, HTTP input character encoding may be converted to mbstring.internal_encoding automatically.
http://www.php.net/manual/en/function.pg-set-client-encoding.php
pg_set_client_encoding
(PHP 3 CVS only, PHP 4 >= 4.0.3)
pg_set_client_encoding-- Set the client encoding
Description
int pg_set_client_encoding ([int connection, string encoding])
The function set the client encoding and return 0 if success or -1 if error.
encoding is the client encoding and can be either : SQL_ASCII, EUC_JP, EUC_CN, EUC_KR, EUC_TW, UNICODE, MULE_INTERNAL, LATINX (X=1...9), KOI8, WIN, ALT, SJIS, BIG5, WIN1250.