utf-8 support

#257 utf-8 support

Milestone: Rendering

Status: closed

Owner: Reini Urban

Labels: None

Priority: 5

Updated: 2012-10-11

Created: 2004-03-15

Creator:

Private: No

Hi,

we're using phpwiki (http://www.wlug.org.nz) (version
"1.3.3-jeffs-hacks" including lots of our our minor
hacks, and thought you might want to know about some of
the stuff we did to get utf-8 mostly working everywhere.

Some of this might be done/different in cvs head - I
have no idea :)

phpwiki/index.php:
-define("CHARSET", "iso-8859-1");
+define("CHARSET","UTF-8");

we have a really ugly WikiNameRegexp - I couldn't get
pcre to use non-ascii [:upper:] and [:lower:] POSIX RE
classes, even with the right locale set:

catches (most) Western accented chars encoded in utf-8

\xc3\x80 - \xc3\x9e are Latin upper-case accented chars

\xc3\x9f - \xc3\bf are Latin lower-case accented chars

$WikiNameRegexp =
"(?<![[:alnum:]])(?:(?:(?:[A-Z]|[\xc3][\x80-\x9e])(?:[a-z]|[\xc3][\x9f-\xbf])+){2,})(?![[:alnum:]]+)";

phpwiki/lib/HtmlElement.php:
-define('NBSP', "\xA0"); // iso-8859-x
non-breaking space
+define('NBSP',"\xC2\xA0"); // utf-8 non-breaking
space

-$FieldSeparator = "\x81";
+$FieldSeparator = "\xFF"; // this byte should
never appear in utf-8

phpwiki/lib/diff.php and display.php:
needs
+header("Content-Type: text/html; charset=" . CHARSET);
printed out before doing each GeneratePage

Discussion

Reini Urban - 2004-04-10

Logged In: YES
user_id=13755

japanese seems to work fine now with utf-8 now.
Can you check?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2004-04-13

Logged In: YES
user_id=88277

Hi, I forgot a few things.

1) lib/editpage.php needs
+header("Content-Type: text/html; charset=" . CHARSET);
before the GeneratePage() call as well, and we also put it
in lib/main.php, at the top of the main() function.

2) We converted login.tmpl to use utf-8 encoding for the
example characters

3) We put the WikiNameRegexp back to
"(?:[[:upper:]][[:lower:]]+){2,}"; to keep it nice and clean,
and we modified lib/config.php 's pcre_fix_posix_classes()
function to turn [:upper:] and [:lower:] into the ugly regexp:

static $classes = array( 'alnum' =>

"0-9A-Za-z\xc0-\xd6\xd8-\xf6\xf8-\xff",
'alpha' =>
"A-Za-z\xc0-\xd6\xd8-\xf6\xf8-\xff",
# 'upper' =>
"A-Z\xc0-\xd6\xd8-\xde",
# 'lower' => "a-z\xdf-\xf6\xf8-\xff"
);

# until posix class names/pcre work with utf-8

utf-8 non-ascii chars: most common (eg western) latin

chars are 0xc380-0xc3bf

we currently ignore other less common non-ascii characters

(eg central/east european) latin chars are 0xc432-0xcdbf

and 0xc580-0xc5be

and indian/cyrillic/asian languages

# this replaces [[:lower:]] with utf-8 match (Latin only) $regexp = preg_replace('/\[\[\:lower\:\]\]/',

'(?:[a-z]|\xc3[\x9f-\xbf]|\xc4[\x81\x83\x85\x87])',
$regexp);
# this replaces [[:upper:]] with utf-8 match (Latin only)
$regexp = preg_replace('/[[\:upper\:]]/',

'(?:[A-Z]|\xc3[\x80-\x9e]|\xc4[\x80\x82\x84\x86])',
$regexp);

$keys = join('|', array_keys($classes)); return preg_replace("/\[:($keys):]/e", '$classes["\1"]',

$regexp);
}

I pasted some Japanese into a page, you can check/edit our
page: http://www.wlug.org.nz/TestUtf8

I have no idea how you decide what is a WikiWord in kanji
though :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reini Urban - 2004-04-15

Logged In: YES
user_id=13755

Now also missing automatic pcre_fix_posix_classes for utf-8
is iincluded in the latest CVS.

1.3.9 has all the other fixes included, just the
$WikiNameRegexp has to be fixed there manually.

I'll now included full dynamic language changes for utf-8
languages also, which currentlly has to use
define('CHARSET' 'utf-8');

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

utf-8 support

Group

Searches

Help

#257 utf-8 support

catches (most) Western accented chars encoded in utf-8

\xc3\x80 - \xc3\x9e are Latin upper-case accented chars

\xc3\x9f - \xc3\bf are Latin lower-case accented chars

Discussion

utf-8 non-ascii chars: most common (eg western) latin

we currently ignore other less common non-ascii characters

(eg central/east european) latin chars are 0xc432-0xcdbf

and indian/cyrillic/asian languages