> 1) I'm currently encoding strings to UTF-8 before sending them to the
> database and converting strings from UTF-8 into the local character
> set when reading from the database. Is there a better way?
Using UTF-8 is usually the best way to store Unicode strings in the
database, unless the database has direct Unicode support. But Unicode
resp. UTF-8 has its drawbacks, since the number of bytes needed in UTF-
8 is usually greater than the number of Unicode characters in the
string. In most databases a string field is declared as VARCHAR(n) for
example where n is the maximum number of characters which can be
stored. If you store UTF-8 strings n is the number of bytes, not the
number of characters. A Unicode character may occupy 1, 2, 3 or 4 bytes
in UTF-8.
> 2) I'd like to add unit tests to make sure that the original Unicode
> text is preserved. For this I was planning on reading in text files
> of different character set encodings, placing them in the database as
> strings (VARCHAR), reading the values from the database again, and
> comparing the text file to the retrieved value. Does anyone know a
> good place to find sample Unicode documents for this purpose?
Maybe the following pages are a good starting point to find examples:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.i18nguy.com/unicode/unicode-example-intro.html
Regards,
Ulrich
--
E-Mail privat: Ulr...@gm...
E-Mail Studium: Ulr...@Fe...
World Wide Web: http://www.stud.fernuni-hagen.de/q1471341
|