#257 utf-8 support



we're using phpwiki (http://www.wlug.org.nz) (version
"1.3.3-jeffs-hacks" including lots of our our minor
hacks, and thought you might want to know about some of
the stuff we did to get utf-8 mostly working everywhere.

Some of this might be done/different in cvs head - I
have no idea :)

-define("CHARSET", "iso-8859-1");

we have a really ugly WikiNameRegexp - I couldn't get
pcre to use non-ascii [:upper:] and [:lower:] POSIX RE
classes, even with the right locale set:

catches (most) Western accented chars encoded in utf-8

\xc3\x80 - \xc3\x9e are Latin upper-case accented chars

\xc3\x9f - \xc3\bf are Latin lower-case accented chars

$WikiNameRegexp =

-define('NBSP', "\xA0"); // iso-8859-x
non-breaking space
+define('NBSP',"\xC2\xA0"); // utf-8 non-breaking

-$FieldSeparator = "\x81";
+$FieldSeparator = "\xFF"; // this byte should
never appear in utf-8

phpwiki/lib/diff.php and display.php:
+header("Content-Type: text/html; charset=" . CHARSET);
printed out before doing each GeneratePage


  • Reini Urban

    Reini Urban - 2004-04-10

    Logged In: YES

    japanese seems to work fine now with utf-8 now.
    Can you check?

  • Anonymous - 2004-04-13

    Logged In: YES

    Hi, I forgot a few things.

    1) lib/editpage.php needs
    +header("Content-Type: text/html; charset=" . CHARSET);
    before the GeneratePage() call as well, and we also put it
    in lib/main.php, at the top of the main() function.

    2) We converted login.tmpl to use utf-8 encoding for the
    example characters

    3) We put the WikiNameRegexp back to
    "(?:[[:upper:]][[:lower:]]+){2,}"; to keep it nice and clean,
    and we modified lib/config.php 's pcre_fix_posix_classes()
    function to turn [:upper:] and [:lower:] into the ugly regexp:

    static $classes = array(
                            'alnum' =>

    'alpha' =>
    # 'upper' =>
    # 'lower' => "a-z\xdf-\xf6\xf8-\xff"

    # until posix class names/pcre work with utf-8

    utf-8 non-ascii chars: most common (eg western) latin

    chars are 0xc380-0xc3bf

    we currently ignore other less common non-ascii characters

    (eg central/east european) latin chars are 0xc432-0xcdbf

    and 0xc580-0xc5be

    and indian/cyrillic/asian languages

    # this replaces [[:lower:]] with utf-8 match (Latin only)
    $regexp = preg_replace('/\[\[\:lower\:\]\]/',

    # this replaces [[:upper:]] with utf-8 match (Latin only)
    $regexp = preg_replace('/[[\:upper\:]]/',


    $keys = join('|', array_keys($classes));
    return preg_replace("/\[:($keys):]/e", '$classes["\1"]',


    I pasted some Japanese into a page, you can check/edit our
    page: http://www.wlug.org.nz/TestUtf8

    I have no idea how you decide what is a WikiWord in kanji
    though :)

  • Reini Urban

    Reini Urban - 2004-04-15

    Logged In: YES

    Now also missing automatic pcre_fix_posix_classes for utf-8
    is iincluded in the latest CVS.

    1.3.9 has all the other fixes included, just the
    $WikiNameRegexp has to be fixed there manually.

    I'll now included full dynamic language changes for utf-8
    languages also, which currentlly has to use
    define('CHARSET' 'utf-8');


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks