#257 utf-8 support

Rendering
closed
None
5
2012-10-11
2004-03-15
No

Hi,

we're using phpwiki (http://www.wlug.org.nz) (version
"1.3.3-jeffs-hacks" including lots of our our minor
hacks, and thought you might want to know about some of
the stuff we did to get utf-8 mostly working everywhere.

Some of this might be done/different in cvs head - I
have no idea :)

phpwiki/index.php:
-define("CHARSET", "iso-8859-1");
+define("CHARSET","UTF-8");

we have a really ugly WikiNameRegexp - I couldn't get
pcre to use non-ascii [:upper:] and [:lower:] POSIX RE
classes, even with the right locale set:

catches (most) Western accented chars encoded in utf-8

\xc3\x80 - \xc3\x9e are Latin upper-case accented chars

\xc3\x9f - \xc3\bf are Latin lower-case accented chars

$WikiNameRegexp =
"(?<![[:alnum:]])(?:(?:(?:[A-Z]|[\xc3][\x80-\x9e])(?:[a-z]|[\xc3][\x9f-\xbf])+){2,})(?![[:alnum:]]+)";

phpwiki/lib/HtmlElement.php:
-define('NBSP', "\xA0"); // iso-8859-x
non-breaking space
+define('NBSP',"\xC2\xA0"); // utf-8 non-breaking
space

-$FieldSeparator = "\x81";
+$FieldSeparator = "\xFF"; // this byte should
never appear in utf-8

phpwiki/lib/diff.php and display.php:
needs
+header("Content-Type: text/html; charset=" . CHARSET);
printed out before doing each GeneratePage

Discussion

  • Reini Urban

    Reini Urban - 2004-04-10

    Logged In: YES
    user_id=13755

    japanese seems to work fine now with utf-8 now.
    Can you check?

     
  • Anonymous - 2004-04-13

    Logged In: YES
    user_id=88277

    Hi, I forgot a few things.

    1) lib/editpage.php needs
    +header("Content-Type: text/html; charset=" . CHARSET);
    before the GeneratePage() call as well, and we also put it
    in lib/main.php, at the top of the main() function.

    2) We converted login.tmpl to use utf-8 encoding for the
    example characters

    3) We put the WikiNameRegexp back to
    "(?:[[:upper:]][[:lower:]]+){2,}"; to keep it nice and clean,
    and we modified lib/config.php 's pcre_fix_posix_classes()
    function to turn [:upper:] and [:lower:] into the ugly regexp:

    static $classes = array(
                            'alnum' =>
    

    "0-9A-Za-z\xc0-\xd6\xd8-\xf6\xf8-\xff",
    'alpha' =>
    "A-Za-z\xc0-\xd6\xd8-\xf6\xf8-\xff",
    # 'upper' =>
    "A-Z\xc0-\xd6\xd8-\xde",
    # 'lower' => "a-z\xdf-\xf6\xf8-\xff"
    );

    # until posix class names/pcre work with utf-8
    

    utf-8 non-ascii chars: most common (eg western) latin

    chars are 0xc380-0xc3bf

    we currently ignore other less common non-ascii characters

    (eg central/east european) latin chars are 0xc432-0xcdbf

    and 0xc580-0xc5be

    and indian/cyrillic/asian languages

    # this replaces [[:lower:]] with utf-8 match (Latin only)
    $regexp = preg_replace('/\[\[\:lower\:\]\]/',
    

    '(?:[a-z]|\xc3[\x9f-\xbf]|\xc4[\x81\x83\x85\x87])',
    $regexp);
    # this replaces [[:upper:]] with utf-8 match (Latin only)
    $regexp = preg_replace('/[[\:upper\:]]/',

    '(?:[A-Z]|\xc3[\x80-\x9e]|\xc4[\x80\x82\x84\x86])',
    $regexp);

    $keys = join('|', array_keys($classes));
    
    return preg_replace("/\[:($keys):]/e", '$classes["\1"]',
    

    $regexp);
    }

    I pasted some Japanese into a page, you can check/edit our
    page: http://www.wlug.org.nz/TestUtf8

    I have no idea how you decide what is a WikiWord in kanji
    though :)

     
  • Reini Urban

    Reini Urban - 2004-04-15

    Logged In: YES
    user_id=13755

    Now also missing automatic pcre_fix_posix_classes for utf-8
    is iincluded in the latest CVS.

    1.3.9 has all the other fixes included, just the
    $WikiNameRegexp has to be fixed there manually.

    I'll now included full dynamic language changes for utf-8
    languages also, which currentlly has to use
    define('CHARSET' 'utf-8');

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks