|
From: Gerhard <ger...@gm...> - 2001-11-22 05:46:13
|
Ok, maybe I'll just describe what I've done so far (locally).
- Changed the PgSQL module to accept also UnicodeType where it accepts
StringType
- Before sending the query string to the libpq module, check if the query
string is of type Unicode, if so, encode it via UTF-8 to a StringType and
send this one instead
- in pgconnection.c, added a read-write attribute clientencoding to the
PgConnection_Type
All of this works pretty well so far, for example the following works as
expected (never mind if you see weird chars, it's 'Internet' in Russian KOI-8
encoding):
#!/usr/bin/env python
from pyPgSQL import PgSQL
con = PgSQL.connect(database="testu")
cursor = con.cursor()
name = unicode("éÎÔÅÒÎÅÔ", "koi8-r") # 'Internet' in Russian
cursor.execute("insert into gh (name) values ('%s')" % name)
print con.conn.clientencoding # 'UNICODE'
con.conn.clientencoding = 'KOI8'
print con.conn.clientencoding # 'KOI-8'
cursor.execute("select * from gh")
print cursor.fetchone()[0] # works, is automatically converted
For languages that cannot be encoded in 8 bits, I fear it will get more
complicated. So I propose the following:
- Strings sent to the backend: Unicode is encoded as UTF-8. StringType is sent
as-is like before (with escaping as needed). If people set the
clientencoding, PostgreSQL will even do the charset conversion (to Unicode or
whatever) for them.
- Strings retrieved from the backend: If the client-encoding is UNICODE,
strings are always retrieved as UnicodeType. This is a major change, but it's
IMO necessary to make using east-asian languages possible at all. If people
want to receive StringType but the data can possibly be Unicode, they have to
set the client-encoding accordingly. For German, I'd have to set
clientencoding to 'LATIN1', for example.
- If the PostgreSQL client-encoding is any of the special non-Unicode ones like
SJIS, BIG5 or whatever, major reality failure happens ;-) I have no idea
about these encodings, and neither has Python.
Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id 86AB43C0
public key fingerprint: DEC1 1D02 5743 1159 CD20 A4B6 7B22 6575 86AB 43C0
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))
|
|
From: Billy G. A. <Bil...@mu...> - 2001-11-28 23:21:00
|
Adam Buraczewski wrote:
> On Mon, Nov 26, 2001 at 02:24:13PM +0100, Gerhard H=E4ring wrote:
>>> Then the PgSQL module should create a new Connection object, make a
>>> connection to the database, and send:
>>>
>>> SET CLIENT_ENCODING TO 'LATIN2';
>>
>> I think it's a better idea that the connect method gets an optional
>> parameter client_encoding (used if and only if conversions to/from
>> Unicode are done), but the user has to issue a
>> "SET CLIENT_ENCODING TO 'whatever'" manually, too.
>
> OK, it looks good for me. As a programmer I don't like when a library
> wrapper works behind the scenes and sends commands to a database
> backend on its own. However, it would be nice if PgSQL could send
> automatically some commands to PostgreSQL backend on every session or
> transaction start. Lately I have convinced Billy (at least I hope so
> ;) ) to introduce a transaction isolation level support, which is
> still absent from other Python interfaces for PostgreSQL (pyPgSQL is
> the first, as I know).
You have. I have it implemented on my machine, but I've been swamped
with work related issues, leaving me little time for the fun stuff
(pyPgSQL, etc.). I will put the patch up on Friday for the transaction
level related changes. I am also going to propose that transaction level
support be added to the next DB-API specification.
> I thought a bit about all that and a general
> solution came to my mind: two lists (or, even better: dictionaires) of
> strings. One of them should be sent to PostgreSQL backend on session
> start, the other just after every "BEGIN" command. It would be then
> possible to write something like this (an example, of course):
>
> conn =3D PgSQL.connect(database =3D "dbname,
> client_encoding =3D 'iso8859-2',
> on_session_start =3D ["SET CLIENT_ENCODING TO 'LAT
> I=
> N2';"],
> on_transaction_start =3D ["SET TRANSACTION ISOLATI
> O=
> N LEVEL SERIALIZABLE;"])
> .
> .
> .
> conn.on_transaction_start.append("some SQL commands");
>
> I agree that this idea could be not very bright ;) Especially
> isolation levels probably should be treated separately, due to their
> special meaning. However, such functionality should ease providing
> future enhancements which will unlikely be demanded by a growing
> community of pyPgSQL users :))
>
> What do You think about this?
I am leary of straying to far from the DB-API specification in the PgSQL
module (now the libpq module is a horse of a different color - it makes no
claim of DB-API compatiblity).
> I'd like to add here that for me, DBI 2.0's cursors should be used
> only for typical DQL statements, like SELECT, INSERT, UPDATE and
> DELETE. Other SQL commands (especially those which CREATE or DROP
> something, ALTER a database structure, or SET some parameters)
> shouldn't be used this way (since they usually cannot be issued during
> a transaction, for instance), but DBI specification does not provide
> any good solution for this. I think that this all is because programs
> which make use of DBI-compatible libraries should be portable (to
> other DBMSes), and that a good, transaction-safe method of sending
> these commands to PostgreSQL should be proposed here.
Actually, with the newer versions of PostgreSQL, things that could not be
in a transaction are now transaction-safe. For example, in version 7.1,
you can drop tables/indices within a transaction, but you couldn't in
previous versions. Also, you can use another connection with autocommit on
to do the CREATEs, DROPs and ALTERs.
___________________________________________________________________________
____ | Billy G. Allie | Domain....: Bil...@mu...
| /| | 7436 Hartwell | MSN.......: B_G...@em...
|-/-|----- | Dearborn, MI 48126|
|/ |LLIE | (313) 582-1540 |
|
|
From: Adam B. <ad...@po...> - 2001-11-23 01:34:35
Attachments:
mbc-test.py
|
Hallo,
I'm also interested in good working of pyPgSQL with various string
encodings. I mainly use ISO 8859-2 at server side and Win CP 1250 or
UTF-8 at client side.
On Thu, Nov 22, 2001 at 06:46:03AM +0100, Gerhard H=E4ring wrote:
> - Changed the PgSQL module to accept also UnicodeType where it accepts
> StringType
It sounds great for me :)
> - Before sending the query string to the libpq module, check if the que=
ry
> string is of type Unicode, if so, encode it via UTF-8 to a StringType=
and
> send this one instead
Well, it should be rather converted into current database client
encoding IMHO. You shouldn't assume that when someone uses Python
unicode strings, he/she wants also to use UNICODE at server side. The
reason is that PostgreSQL still does not handle Unicode/UTF-8
completely (for example, there are problems with Polish diacritical
characters which are absent when only 8-bit encoding is used at server
side).
> - in pgconnection.c, added a read-write attribute clientencoding to the
> PgConnection_Type
I cannot agree with changing anything in pyPgSQL.libpq. It is a
low-level module, which has the same functionality as PostgreSQL
native libpq library. It should only send data to the server and
allow to read results, nothing more. Especially it shouldn't change
character encodings implicitly.
At least changing the way libpq deals with strings, would break some
of my programs. ;((
However, such functionality should be obviously added to pyPgSQL.PgSQL
module. It would be nice to write something like this (an example):
conn =3D PgSQL.connect(database =3D 'dbname',=20
client_encoding =3D 'iso8859-2',
unicode_results =3D 0)
Then the PgSQL module should create a new Connection object, make a
connection to the database, and send:
SET CLIENT_ENCODING TO 'LATIN2';
to the PostgreSQL backend. Later, instructions like:
c =3D conn.cursor()
c.execute(u'select sth from tab where field =3D %s;', u'aaaa')
should change both Unicode strings to ISO 8859-2, perform argument
substitution, and send a query to backend. Results should be left
without change (encoded in client_encoding), unless "unicode_results
=3D=3D 1", when all strings should be converted back to Unicode strings.
Please remember also that it is possible that someone uses PostgreSQL
without unicode and conversion-on-the-fly facilities. In such
circumstances "client_encoding" and "unicode_results" variables should
not be set to anything, and PgSQL should not recode any strings (using
Unicode strings should be illegal) neither send "SET CLIENT_ENCODING"
commands to the backend.
I attached a small Python program which checks how PgSQL works with
various client-backend encodings. I wrote it for Billy G. Allie some
time ago. Feel free to use and modify it, according to Your needs.
Regards,
--=20
Adam Buraczewski <ad...@po...> * Linux registered user #165585
GCS/TW d- s-:+>+:- a- C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K? w--
O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI? D G++ e+++>++++ h r+>++ y?
|
|
From: Gerhard <ger...@gm...> - 2001-11-26 13:24:25
|
Adam, thanks for letting me know your thoughs about this. I hope it's okay to CC the list. On Fri, Nov 23, 2001 at 02:12:11AM +0100, Adam Buraczewski wrote: > Hallo, > > I'm also interested in good working of pyPgSQL with various string > encodings. I mainly use ISO 8859-2 at server side and Win CP 1250 or > UTF-8 at client side. > > On Thu, Nov 22, 2001 at 06:46:03AM +0100, Gerhard Häring wrote: > > - Changed the PgSQL module to accept also UnicodeType where it accepts > > StringType > > It sounds great for me :) > > > - Before sending the query string to the libpq module, check if the query > > string is of type Unicode, if so, encode it via UTF-8 to a StringType and > > send this one instead > > Well, it should be rather converted into current database client > encoding IMHO. You shouldn't assume that when someone uses Python > unicode strings, he/she wants also to use UNICODE at server side. The > reason is that PostgreSQL still does not handle Unicode/UTF-8 > completely (for example, there are problems with Polish diacritical > characters which are absent when only 8-bit encoding is used at server > side). My implementation now converts from/to Unicode using the currently selected client_encoding (see below). > > - in pgconnection.c, added a read-write attribute clientencoding to the > > PgConnection_Type > > I cannot agree with changing anything in pyPgSQL.libpq. [...] All I did was expose the functions from PostgreSQL's libpq for changing and querying the current client_encoding. Now I've dropped all this because it's not necessary and causing problems (see below). > However, such functionality should be obviously added to pyPgSQL.PgSQL > module. It would be nice to write something like this (an example): > > conn = PgSQL.connect(database = 'dbname', > client_encoding = 'iso8859-2', > unicode_results = 0) I like this proposal very much. Partly because it's almost what I had in mind anyway :) > Then the PgSQL module should create a new Connection object, make a > connection to the database, and send: > > SET CLIENT_ENCODING TO 'LATIN2'; I've started implementing this, but when I had almost finished it I threw it all away. The reason is that this would become a maintenance nightmare later on. I'd have to know about all possible names of an encoding at Python-side, normalize them (using an ugly try-catch and encodings.aliases) and keep a dictionary to map the Python encoding name to the PostgreSQL encoding name. This dictionary would have to be updated once new PostgreSQL encodings become available. I think it's a better idea that the connect method gets an optional parameter client_encoding (used if and only if conversions to/from Unicode are done), but the user has to issue a "SET CLIENT_ENCODING TO 'whatever'" manually, too. I've changed the connect method (and the Connection constructor) like this: - add a new paramter client_encoding. If client_encoding is None, it defaults to sys.getdefaultencoding(), if it is a string, self.client_encoding is set to (client_encoding, ) else it's left unchanged. The tuple sys.client_encoding is expanded to the parameters of the string encode function and the second and third parameters of the unicode() function when doing charset conversion. - add a unicode_results parameter. If true, the typecast() method in TypeCache changes strings to Unicode strings using the client_encoding of the connection object > to the PostgreSQL backend. Later, instructions like: > > c = conn.cursor() > c.execute(u'select sth from tab where field = %s;', u'aaaa') > > should change both Unicode strings to ISO 8859-2, perform argument > substitution, and send a query to backend. Results should be left > without change (encoded in client_encoding), unless "unicode_results > == 1", when all strings should be converted back to Unicode strings. > > Please remember also that it is possible that someone uses PostgreSQL > without unicode and conversion-on-the-fly facilities. In such > circumstances "client_encoding" and "unicode_results" variables should > not be set to anything, and PgSQL should not recode any strings (using > Unicode strings should be illegal) neither send "SET CLIENT_ENCODING" > commands to the backend. Hmm. As I said I'd rather not let pyPgSQL send SET CLIENT_ENCODING commands,b but for finding out wether libpq and/or the backend support Unicode or charset conversion, I think I'll need additional functions in libpq (if only for checking wether the MULTIBYTE macro is defined). > I attached a small Python program which checks how PgSQL works with > various client-backend encodings. I wrote it for Billy G. Allie some > time ago. Feel free to use and modify it, according to Your needs. Thanks, that will sure be useful for testing. Everything is far from finished but I'd like to hear what others (esp. Billy) think about the interface and wether my approach is right. Gerhard -- mail: gerhard <at> bigfoot <dot> de registered Linux user #64239 web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id 86AB43C0 public key fingerprint: DEC1 1D02 5743 1159 CD20 A4B6 7B22 6575 86AB 43C0 reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b'))) |
|
From: Adam B. <ad...@po...> - 2001-11-28 20:24:50
|
On Mon, Nov 26, 2001 at 02:24:13PM +0100, Gerhard H=E4ring wrote:
> > Then the PgSQL module should create a new Connection object, make a
> > connection to the database, and send:
> >=20
> > SET CLIENT_ENCODING TO 'LATIN2';
>=20
> I think it's a better idea that the connect method gets an optional par=
ameter
> client_encoding (used if and only if conversions to/from Unicode are do=
ne), but
> the user has to issue a "SET CLIENT_ENCODING TO 'whatever'" manually, t=
oo.
OK, it looks good for me. As a programmer I don't like when a library
wrapper works behind the scenes and sends commands to a database
backend on its own. However, it would be nice if PgSQL could send
automatically some commands to PostgreSQL backend on every session or
transaction start. Lately I have convinced Billy (at least I hope so
;) ) to introduce a transaction isolation level support, which is
still absent from other Python interfaces for PostgreSQL (pyPgSQL is
the first, as I know). I thought a bit about all that and a general
solution came to my mind: two lists (or, even better: dictionaires) of
strings. One of them should be sent to PostgreSQL backend on session
start, the other just after every "BEGIN" command. It would be then
possible to write something like this (an example, of course):
conn =3D PgSQL.connect(database =3D "dbname,
client_encoding =3D 'iso8859-2',
on_session_start =3D ["SET CLIENT_ENCODING TO 'LATI=
N2';"],
on_transaction_start =3D ["SET TRANSACTION ISOLATIO=
N LEVEL SERIALIZABLE;"])
.
.
.
conn.on_transaction_start.append("some SQL commands");
I agree that this idea could be not very bright ;) Especially
isolation levels probably should be treated separately, due to their
special meaning. However, such functionality should ease providing
future enhancements which will unlikely be demanded by a growing
community of pyPgSQL users :))
What do You think about this?
I'd like to add here that for me, DBI 2.0's cursors should be used
only for typical DQL statements, like SELECT, INSERT, UPDATE and
DELETE. Other SQL commands (especially those which CREATE or DROP
something, ALTER a database structure, or SET some parameters)
shouldn't be used this way (since they usually cannot be issued during
a transaction, for instance), but DBI specification does not provide
any good solution for this. I think that this all is because programs
which make use of DBI-compatible libraries should be portable (to
other DBMSes), and that a good, transaction-safe method of sending
these commands to PostgreSQL should be proposed here.
Regards,
--=20
Adam Buraczewski <ad...@po...> * Linux registered user #165585
GCS/TW d- s-:+>+:- a- C+++(++++) UL++++$ P++ L++++ E++ W+ N++ o? K? w--
O M- V- PS+ !PE Y PGP+ t+ 5 X+ R tv- b+ DI? D G++ e+++>++++ h r+>++ y?
|