Thread: Re: [Pypgsql-users] Implementing Unicode

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ok, maybe I'll just describe what I've done so far (locally).

- Changed the PgSQL module to accept also UnicodeType where it accepts
  StringType
- Before sending the query string to the libpq module, check if the query
  string is of type Unicode, if so, encode it via UTF-8 to a StringType and
  send this one instead

- in pgconnection.c, added a read-write attribute clientencoding to the
  PgConnection_Type

All of this works pretty well so far, for example the following works as
expected (never mind if you see weird chars, it's 'Internet' in Russian KOI-8
encoding):

#!/usr/bin/env python
from pyPgSQL import PgSQL
con = PgSQL.connect(database="testu")
cursor = con.cursor()
name = unicode("éÎÔÅÒÎÅÔ", "koi8-r")       # 'Internet' in Russian
cursor.execute("insert into gh (name) values ('%s')" % name)
print con.conn.clientencoding              # 'UNICODE'
con.conn.clientencoding = 'KOI8'
print con.conn.clientencoding              # 'KOI-8'
cursor.execute("select * from gh")
print cursor.fetchone()[0]                 # works, is automatically converted

For languages that cannot be encoded in  8 bits, I fear it will get more
complicated. So I propose the following:

- Strings sent to the backend: Unicode is encoded as UTF-8. StringType is sent
  as-is like before (with escaping as needed). If people set the
  clientencoding, PostgreSQL will even do the charset conversion (to Unicode or
  whatever) for them.

- Strings retrieved from the backend: If the client-encoding is UNICODE,
  strings are always retrieved as UnicodeType. This is a major change, but it's
  IMO necessary to make using east-asian languages possible at all. If people
  want to receive StringType but the data can possibly be Unicode, they have to
  set the client-encoding accordingly. For German, I'd have to set
  clientencoding to 'LATIN1', for example.

- If the PostgreSQL client-encoding is any of the special non-Unicode ones like
  SJIS, BIG5 or whatever, major reality failure happens ;-) I have no idea
  about these encodings, and neither has Python.

Gerhard
-- 
mail:   gerhard <at> bigfoot <dot> de       registered Linux user #64239
web:    http://www.cs.fhm.edu/~ifw00065/    OpenPGP public key id 86AB43C0
public key fingerprint: DEC1 1D02 5743 1159 CD20  A4B6 7B22 6575 86AB 43C0
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))

Thread: Re: [Pypgsql-users] Implementing Unicode

pypgsql-users