Menu

WikiInternationalization

2001-11-09
2012-10-11
  • Carsten Klapp

    Carsten Klapp - 2001-11-09

    Discussion about future of international version o Wiki, how does http and php handle unicode, utf-8 and non-ascii?

     
    • Carsten Klapp

      Carsten Klapp - 2001-11-09

      Mac OS X HFS+ partitions and UDF disks use unicode to store filenames but MacCVS Pro needs to be updated to take advantage of this (as well as filenames > 31 characters). MacCVS Pro does appear to correctly handle iso-8859-1 characters within documents just not the filenames yet, there are some notes in MacCVS sourceforge about the problem but no patches or updates yet for the long (unicode) filenames. The command-line CVS client for Mac OS X also seems to suffer from the same problem, but it is not the latest available version.

      It would be unfortunate to have to remove extended ascii characters from the filenames. As a transition to unicode the W3C recommends URIs should be encoded with UTF-8 then non-ascii characters escaped with 4 digit hex %hh%hh <http://www.w3.org/International/O-URL-and-ident> but this is not yet implemented widely, instead being encoded with a subset of ascii according to rfc2396. However, just because URIs are encoded with a subset of ascii doesn't necessarily mean that filenames must also be.

      I wonder would it add an extra level of complexity to the Wiki code to translate filenames, and how will different web servers serve pages whose filenames are already encoded? Browsers are supposed to automatically translate the text in the URL field to replace % sequences with native text--and only use the % while communicating with other hosts--but I believe many do not consistently do so, I have seen some unfortunate web pages with strange double-encoded file names (%hh encoded a second time as %hh%hh%hh).

      The best choice for Wiki Internationalization may be to convert all text strings and database content to unicode. I'm sure this would present its own additional problems within PHP. Maybe unicode could be planned for a version 3 or 4 of Wiki.

       
      • Geoffrey T. Dairiki

        The filename issue, I think, is not such a big deal in itself.  The code already urldecode()s filenames before interpreting them as wiki page names.  So renaming the file to the urlencode()d version works just fine.

        In other words, limiting the PhpWiki source file names to plain ASCII does not limit wiki page names or wiki page content to ASCII.

        And you're most certainly correct that moving to Unicode is the way to go.  The problem is that you're also right about problems with PHP.  It's really too big of a headache (for me at least) to start thinking about moving PhpWiki to unicode until PHP's regular expression and string functions deal with unicode well.  (I don't think they do, yet --- but correct me if I'm wrong.)

        (Another issue is support for unicode in whatever SQL servers we want to use.  I.e.: How well are text searches through fields with UTF-8 data supported?)

        So, for now, stock PhpWiki supports only iso-8859-1 (though one could easily modify the code to work with any eight-bit encoding, I think)

         
    • Carsten Klapp

      Carsten Klapp - 2001-11-11

      I found some information about Unicode & UTF-8 at the php.net web site. There is a function which supposedly is able to set the encoding to UNICODE. Another page talks about an extended module for PHP which supports UTF-8 but not unicode. I'm not sure how this all translates into whether or not unicode or UTF-8 can be used with phpwiki.

      http://www.php.net/manual/en/ref.mbstring.php

      mbstring is an experimental extended PHP module. It must be enabled with configure script:

        --enable-mbstring : Enable mbstring functions. This option is required to use mbstring functions.

        --enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using mbstring conversion engine. If this feature is enabled, HTTP input character encoding may be converted to mbstring.internal_encoding automatically.

      http://www.php.net/manual/en/function.pg-set-client-encoding.php

      pg_set_client_encoding
      (PHP 3 CVS only, PHP 4 >= 4.0.3)

      pg_set_client_encoding-- Set the client encoding

      Description
      int pg_set_client_encoding ([int connection, string encoding])

      The function set the client encoding and return 0 if success or -1 if error.

      encoding is the client encoding and can be either : SQL_ASCII, EUC_JP, EUC_CN, EUC_KR, EUC_TW, UNICODE, MULE_INTERNAL, LATINX (X=1...9), KOI8, WIN, ALT, SJIS, BIG5, WIN1250.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.