From: Gerhard <ger...@gm...> - 2001-11-22 05:07:57
|
I'll try to summarize my findings on the world beyond US-ASCII in PostgreSQL and Python: Python ====== Python has two string types: StringType and UnicodeType. Unicode is the simpler case (really), because in Unicode every character has a defined meaning. It's meaning doesn't depend on the charset used. StringType is ok as long as you only use US-ASCII (chars <= 127). But if you use 8-bit characters, the meaning of the characters depend on the current charset. This is important, if you want to convert between StringType and UnicodeType, for example. For conversion, you must know in which charset the StringType is encoded. There's only one way to set the default charset in Python and it's awkward (and the designers wanted it like this): You must set it in a sitecustomize.py that must be somewhere in the PYTHONPATH. If you don't set your defaultencoding explicitly, it defaults to 'ascii': >>> import sys >>> sys.getdefaultencoding() 'ascii' Ok, now what happens if I put the following sitecustomize.py in my PYTHONPATH: import sys sys.setdefaultencoding('iso-8859-1') >>> import sys >>> sys.getdefaultencoding() 'iso-8859-1' sys.defaultencoding is the encoding that is used when you don't supply an encoding explicitly when converting between UnicodeType and StringType. PostgreSQL ========== (My PostgreSQL is built with all i18n features on: --enable-unicode-conversion --enable-recode --enable-multibyte --enable-locale) Here's what little info there is from the PostgreSQL docs: http://www.postgresql.org/idocs/index.php?multibyte.html PostgreSQL can set an encoding for the database, and it can have a client-encoding for the client library. Some combination of encodings can be transparently converted by PostgreSQL. It *looks* like (when client-encoding is UNICODE), PostgreSQL sends UTF-8 to the backend, but I haven't found any description of this implementation. Well, UTF-8 seems to be the normal way of sending Unicode around, nowadays. We can create UTF-8 relatively easy from a Python Unicode string: u"whatever".encode("utf-8") So much for my analysis, I'll shortly write something about implementation. Gerhard -- mail: gerhard <at> bigfoot <dot> de registered Linux user #64239 web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id 86AB43C0 public key fingerprint: DEC1 1D02 5743 1159 CD20 A4B6 7B22 6575 86AB 43C0 reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b'))) |