Re: [Pypgsql-users] Implementing Unicode

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Adam, thanks for letting me know your thoughs about this. I hope it's okay to
CC the list.

On Fri, Nov 23, 2001 at 02:12:11AM +0100, Adam Buraczewski wrote:
> Hallo,
> 
> I'm also interested in good working of pyPgSQL with various string
> encodings.  I mainly use ISO 8859-2 at server side and Win CP 1250 or
> UTF-8 at client side.
> 
> On Thu, Nov 22, 2001 at 06:46:03AM +0100, Gerhard Häring wrote:
> > - Changed the PgSQL module to accept also UnicodeType where it accepts
> >   StringType
> 
> It sounds great for me :)
> 
> > - Before sending the query string to the libpq module, check if the query
> >   string is of type Unicode, if so, encode it via UTF-8 to a StringType and
> >   send this one instead
> 
> Well, it should be rather converted into current database client
> encoding IMHO.  You shouldn't assume that when someone uses Python
> unicode strings, he/she wants also to use UNICODE at server side.  The
> reason is that PostgreSQL still does not handle Unicode/UTF-8
> completely (for example, there are problems with Polish diacritical
> characters which are absent when only 8-bit encoding is used at server
> side).

My implementation now converts from/to Unicode using the currently selected
client_encoding (see below).

> > - in pgconnection.c, added a read-write attribute clientencoding to the
> >   PgConnection_Type
> 
> I cannot agree with changing anything in pyPgSQL.libpq. [...]

All I did was expose the functions from PostgreSQL's libpq for changing and
querying the current client_encoding. Now I've dropped all this because it's
not necessary and causing problems (see below).

> However, such functionality should be obviously added to pyPgSQL.PgSQL
> module.  It would be nice to write something like this (an example):
> 
> 	conn = PgSQL.connect(database = 'dbname', 
>                              client_encoding = 'iso8859-2',
> 			     unicode_results = 0)

I like this proposal very much. Partly because it's almost what I had in mind
anyway :)

> Then the PgSQL module should create a new Connection object, make a
> connection to the database, and send:
> 
> 	SET CLIENT_ENCODING TO 'LATIN2';

I've started implementing this, but when I had almost finished it I threw it
all away. The reason is that this would become a maintenance nightmare later
on. I'd have to know about all possible names of an encoding at Python-side,
normalize them (using an ugly try-catch and encodings.aliases) and keep a
dictionary to map the Python encoding name to the PostgreSQL encoding name.
This dictionary would have to be updated once new PostgreSQL encodings become
available.

I think it's a better idea that the connect method gets an optional parameter
client_encoding (used if and only if conversions to/from Unicode are done), but
the user has to issue a "SET CLIENT_ENCODING TO 'whatever'" manually, too.

I've changed the connect method (and the Connection constructor) like this:

- add a new paramter client_encoding. If client_encoding is None, it defaults
  to sys.getdefaultencoding(), if it is a string, self.client_encoding is set
  to (client_encoding, ) else it's left unchanged. The tuple
  sys.client_encoding is expanded to the parameters of the string encode
  function and the second and third parameters of the unicode() function when
  doing charset conversion.
- add a unicode_results parameter. If true, the typecast() method in TypeCache
  changes strings to Unicode strings using the client_encoding of the
  connection object

> to the PostgreSQL backend.  Later, instructions like:
> 
> 	c = conn.cursor()
> 	c.execute(u'select sth from tab where field = %s;', u'aaaa')
> 
> should change both Unicode strings to ISO 8859-2, perform argument
> substitution, and send a query to backend.  Results should be left
> without change (encoded in client_encoding), unless "unicode_results
> == 1", when all strings should be converted back to Unicode strings.
> 

> Please remember also that it is possible that someone uses PostgreSQL
> without unicode and conversion-on-the-fly facilities.  In such
> circumstances "client_encoding" and "unicode_results" variables should
> not be set to anything, and PgSQL should not recode any strings (using
> Unicode strings should be illegal) neither send "SET CLIENT_ENCODING"
> commands to the backend.

Hmm. As I said I'd rather not let pyPgSQL send SET CLIENT_ENCODING commands,b
but for finding out wether libpq and/or the backend support Unicode or charset
conversion, I think I'll need additional functions in libpq (if only for
checking wether the MULTIBYTE macro is defined).

> I attached a small Python program which checks how PgSQL works with
> various client-backend encodings.  I wrote it for Billy G. Allie some
> time ago.  Feel free to use and modify it, according to Your needs.

Thanks, that will sure be useful for testing.

Everything is far from finished but I'd like to hear what others (esp. Billy)
think about the interface and wether my approach is right.

Gerhard
-- 
mail:   gerhard <at> bigfoot <dot> de       registered Linux user #64239
web:    http://www.cs.fhm.edu/~ifw00065/    OpenPGP public key id 86AB43C0
public key fingerprint: DEC1 1D02 5743 1159 CD20  A4B6 7B22 6575 86AB 43C0
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))