Thread: [SQLObject] Unicode handling with sqlobject's UnicodeCol?
SQLObject is a Python ORM.
Brought to you by:
ianbicking,
phd
From: <ber...@zk...> - 2007-02-21 12:20:34
|
Hello all I've tried to correct our string handling by using sqlobject's UnicodeC= ol and stumbled a bit. The following script produces errors when assigning the string or the unicode column: # -- test.py # -!- encoding: iso8859-1 -!- import sqlobject dbURI =3D 'mysql://user1:pass1@db-server/db1' sqlobject.sqlhub.processConnection =3D sqlobject.connectionForURI(dbURI= ) class A(sqlobject.SQLObject): aString =3D sqlobject.StringCol(default =3D None) aUnicodeString =3D sqlobject.UnicodeCol(default =3D None) try: A.createTable() except: pass a =3D A() a.aString =3D 'ga=E4=E4=E4=E4' a.aUnicodeString =3D u'ga\u00ef\u00ef\u00ef' # -- test.py (The string I'm trying to assing contains german umlauts, just in case = it won't display correctly...) I would expect that at least the second assignment should be valid. But= , as the first assignment's value is also a valid python string, shouldn't t= his be valid, too? Do I need to change the sqlobject.sqlhub.processConnection's encoding property? How is this done correctly? Thanks a lot! Bernhard (And sorry for the messy signature...) ___________________________________________________________________ Disclaimer: Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger bestimm= t. Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird, bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden un= d anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige Kopi= en zu vernichten bzw. zu loeschen. Der Gebrauch der Information ist verbot= en. This message is intended only for the named recipient and may contain confidential or privileged information. If you have received it in error, please advise the sender by return e-= mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.= |
From: Oleg B. <ph...@ph...> - 2007-02-21 12:34:11
|
On Wed, Feb 21, 2007 at 01:20:31PM +0100, ber...@zk... wrote: > The following script produces errors when assigning the string or the > unicode column: And what are those errors? Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: <ber...@zk...> - 2007-02-22 08:03:29
|
> On Wed, Feb 21, 2007 at 01:20:31PM +0100, ber...@zk... wrot= e: > > The following script produces errors when assigning the string or t= he > > unicode column: > > And what are those errors? I'm sorry, I forgot to paste this: Traceback (most recent call last): File "test.py", line 17, in ? a.aString =3D 'ga=C3=A4=C3=A4=C3=A4' File "<string>", line 1, in <lambda> File "/home/bernhard/sqlobject/sqlobject/main.py", line 1081, in _SO_setValue self._connection._SO_update( File "/home/bernhard/sqlobject/sqlobject/dbconnection.py", line 614, = in _SO_update self.query("UPDATE %s SET %s WHERE %s =3D (%s)" % File "/home/bernhard/sqlobject/sqlobject/dbconnection.py", line 316, = in query return self._runWithConnection(self._query, s) File "/home/bernhard/sqlobject/sqlobject/dbconnection.py", line 230, = in _runWithConnection val =3D meth(conn, *args) File "/home/bernhard/sqlobject/sqlobject/dbconnection.py", line 313, = in _query self._executeRetry(conn, conn.cursor(), s) File "/home/bernhard/sqlobject/sqlobject/mysql/mysqlconnection.py", l= ine 98, in _executeRetry myquery =3D unicode(query, self.encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33= : ordinal not in range(128) I now found that adding sqlobject.sqlhub.processConnection.encoding =3D 'iso8859-1' to the code resolves the errors. What exactly am I doing with this? Is = it even legal? :-) Thanks a lot Bernhard ___________________________________________________________________ Disclaimer: Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger bestimm= t. Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird, bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden un= d anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige Kopi= en zu vernichten bzw. zu loeschen. Der Gebrauch der Information ist verbot= en. This message is intended only for the named recipient and may contain confidential or privileged information. If you have received it in error, please advise the sender by return e-= mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.= |
From: Oleg B. <ph...@ph...> - 2007-02-22 08:14:49
|
On Thu, Feb 22, 2007 at 09:01:44AM +0100, ber...@zk... wrote: > myquery = unicode(query, self.encoding) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33: > ordinal not in range(128) This is what I suspected - MySQLdb 1.2.1 that insists on using unicode. > I now found that adding > > sqlobject.sqlhub.processConnection.encoding = 'iso8859-1' > > to the code resolves the errors. What exactly am I doing with this? Is it > even legal? :-) It is legal, but it is only the second half of the solution. You'd better pass the encoding to the connection constructor so it passes it further down to MySQLdb connection. You can do it using connection string: URI = "mysql://host/db?charset=iso8859-1&sqlobject_encoding=iso8859-1" (It seems "charset" and "sqlobject_encoding" are always equal, so the patch http://sourceforge.net/tracker/index.php?func=detail&aid=1653898&group_id=74338&atid=540674 unifies them.) Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Lutz S. <l.s...@4c...> - 2007-02-21 12:35:17
|
Hello Bernhard, I have not a real solution for you but a howto (in german) about unicode with Python: http://wiki.python.de/Von_Umlauten%2C_Unicode_und_Encodings After reading this I had the big AHA about unicode and encodings. Maybe it helps. kindly regards Lutz Steinborn 4c AG On Wed, 21 Feb 2007 13:20:31 +0100 ber...@zk... wrote: >=20 >=20 > Hello all >=20 > I've tried to correct our string handling by using sqlobject's UnicodeCol > and stumbled a bit. >=20 > The following script produces errors when assigning the string or the > unicode column: >=20 > # -- test.py > # -!- encoding: iso8859-1 -!- > import sqlobject > dbURI =3D 'mysql://user1:pass1@db-server/db1' > sqlobject.sqlhub.processConnection =3D sqlobject.connectionForURI(dbURI) >=20 > class A(sqlobject.SQLObject): > aString =3D sqlobject.StringCol(default =3D None) > aUnicodeString =3D sqlobject.UnicodeCol(default =3D None) >=20 > try: A.createTable() > except: pass >=20 > a =3D A() > a.aString =3D 'ga=E4=E4=E4=E4' > a.aUnicodeString =3D u'ga\u00ef\u00ef\u00ef' > # -- test.py >=20 > (The string I'm trying to assing contains german umlauts, just in case it > won't display correctly...) >=20 > I would expect that at least the second assignment should be valid. But, = as > the first assignment's value is also a valid python string, shouldn't this > be valid, too? >=20 > Do I need to change the sqlobject.sqlhub.processConnection's encoding > property? >=20 > How is this done correctly? >=20 > Thanks a lot! > Bernhard >=20 |
From: <ber...@zk...> - 2007-02-22 17:43:05
|
> I have not a real solution for you but a howto (in german) about > unicode with Python: > http://wiki.python.de/Von_Umlauten%2C_Unicode_und_Encodings > > After reading this I had the big AHA about unicode and > encodings. Same here, too :-). I indeed got the meanings of encode() and decode() completely wrong... Thanks for this very informative link! @Oleg: I now have charset=3Dlatin1 and sqlobject_encoding=3Dutf-8 in my db URI= string. Just get this right: the sqlobject_encoding tells sqlobject how to trea= t strings coming from and going to the user, whereas the charset defines = the encoding of strings to/from the DB, which should perform under the hood= , right? The thing is, our mysql db has a latin-1 encoding while I'd rather want= our program files (and ultimately the web frontend) to be in UTF-8, so I'd = even expect the two encodings to be different. Does this seem reasonable? Thanks all! Bernhard ___________________________________________________________________ Disclaimer: Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger bestimm= t. Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird, bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden un= d anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige Kopi= en zu vernichten bzw. zu loeschen. Der Gebrauch der Information ist verbot= en. This message is intended only for the named recipient and may contain confidential or privileged information. If you have received it in error, please advise the sender by return e-= mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.= |
From: Oleg B. <ph...@ph...> - 2007-02-22 17:45:43
|
On Thu, Feb 22, 2007 at 06:42:56PM +0100, ber...@zk... wrote: > The thing is, our mysql db has a latin-1 encoding while I'd rather want our > program files (and ultimately the web frontend) to be in UTF-8, so I'd even > expect the two encodings to be different. > > Does this seem reasonable? It does, indeed. Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: <ber...@zk...> - 2007-02-23 10:43:55
|
sql...@li... schrieb am 22.02.2007 18:45:35: > On Thu, Feb 22, 2007 at 06:42:56PM +0100, ber...@zk... wrot= e: > > The thing is, our mysql db has a latin-1 encoding while I'd rather = want our > > program files (and ultimately the web frontend) to be in UTF-8, so = I'd even > > expect the two encodings to be different. > > > > Does this seem reasonable? > > It does, indeed. > Now I'm still confused. This seems to work for utf-8 data written TO th= e DB: i.e. my =E4, =F6 and =FC are correctly visible in the DB, as latin-= 1 encoded strings. When reading FROM the DB, I'd expect someone (mysql or sqlobje= ct) to convert those strings back to utf-8, which isn't happening, though. It seems like e.g. some changes in the string validator would do the jo= b: class StringValidator(validators.Validator): def to_python(self, value, state): if value is None: return None if isinstance(value, str): # Convert from connection.dbEncoding to connection.encoding= via unicode connection =3D state.soObject._connection soEncoding =3D getattr(connection, "encoding", None) or sys.getdefaultencoding() dbEncoding =3D getattr(connection, "dbEncoding", None) or '= ascii' if soEncoding =3D=3D dbEncoding: return value else: return value.decode(dbEncoding).encode(soEncoding) if isinstance(value, unicode): connection =3D state.soObject._connection encoding =3D getattr(connection, "encoding", None) or sys.getdefaultencoding() return value.encode(encoding) return value def from_python(self, value, state): if value is None: return None if isinstance(value, str): return value if isinstance(value, unicode): dbEncoding =3D getattr(connection, "dbEncoding", None) or '= ascii' return value.encode(dbEncoding) return value This works (at least for me), but doesn't seem the right place to do it= . What do you think? Thanks, Bernhard ___________________________________________________________________ Disclaimer: Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger bestimm= t. Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird, bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden un= d anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige Kopi= en zu vernichten bzw. zu loeschen. Der Gebrauch der Information ist verbot= en. This message is intended only for the named recipient and may contain confidential or privileged information. If you have received it in error, please advise the sender by return e-= mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.= |
From: Oleg B. <ph...@ph...> - 2007-02-23 10:56:53
|
On Fri, Feb 23, 2007 at 11:43:51AM +0100, ber...@zk... wrote: > Now I'm still confused. This seems to work for utf-8 data written TO the > DB: i.e. my ?, ? and ? are correctly visible in the DB, as latin-1 encoded > strings. When reading FROM the DB, I'd expect someone (mysql or sqlobject) > to convert those strings back to utf-8, which isn't happening, though. > > It seems like e.g. some changes in the string validator would do the job: StringValidator is intended to pass strings back and forth as is. For recoding there is UnicodeCol and UnicodeValidator, but it only needs one encoding - dbEncoding. Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: <ber...@zk...> - 2007-02-23 13:03:45
|
> StringValidator is intended to pass strings back and forth as is. = For > recoding there is UnicodeCol and UnicodeValidator, but it only needs = one > encoding - dbEncoding. > Ok, I see that one. On the other hand, I would be quite happy to go with StringCols instead= of UnicodeCols. Strings are easier to handle with our c++ python extension= s and then there are some restrictions to the UnicodeCols in sqlobject, t= oo. Looking at the UnicodeStringValidator, I guess that the validators' tas= k is not only to validate the incoming/outgoing values, but also to do the needed conversions, right? In that case, my code in StringValidator wouldn't be completely off place, would it? Bernhard ___________________________________________________________________ Disclaimer: Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger bestimm= t. Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird, bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden un= d anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige Kopi= en zu vernichten bzw. zu loeschen. Der Gebrauch der Information ist verbot= en. This message is intended only for the named recipient and may contain confidential or privileged information. If you have received it in error, please advise the sender by return e-= mail and delete this message and any attachments. Any unauthorised use or dissemination of this information is strictly prohibited.= |
From: Oleg B. <ph...@ph...> - 2007-02-23 14:38:35
|
On Fri, Feb 23, 2007 at 02:03:44PM +0100, ber...@zk... wrote: > On the other hand, I would be quite happy to go with StringCols instead of > UnicodeCols. Strings are easier to handle with our c++ python extensions > and then there are some restrictions to the UnicodeCols in sqlobject, too. > > Looking at the UnicodeStringValidator, I guess that the validators' task is > not only to validate the incoming/outgoing values, but also to do the > needed conversions, right? In that case, my code in StringValidator > wouldn't be completely off place, would it? It would be. StringValidator is intended to pass strings back and forth as is. But you are not obliged to be restricted by the set of columns provided by SQLObject. You can write your own column type and your own validator/converter. To simplify things you can use StringCol and SOStringCol as base classes. After writing such things you can publish them here so people can decide if it's worth to be included into SQLObject. There is one limitation though - "fromDatabase" machinery wouldn't know about your column. Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Oleg B. <ph...@ph...> - 2007-02-21 12:47:22
|
On Wed, Feb 21, 2007 at 01:35:09PM +0100, Lutz Steinborn wrote: > http://wiki.python.de/Von_Umlauten%2C_Unicode_und_Encodings > > After reading this I had the big AHA about unicode and > encodings. I don't think a problem with the program is in unicode itself. I suspect the problem is in "unicode-insisting" version of MySQLdb. There have to be a proper charset/encoding set, and there have to be a proper handling of them in SQLObject's MySQLConnection (and I am not sure how good that is, though there is a constant stream of patches from MySQL users...) > On Wed, 21 Feb 2007 13:20:31 +0100 > ber...@zk... wrote: > > > a = A() > > a.aString = 'ga????' > > a.aUnicodeString = u'ga\u00ef\u00ef\u00ef' Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Simon C. <hod...@gm...> - 2007-02-21 14:25:08
|
On 2/21/07, Oleg Broytmann <ph...@ph...> wrote: > I don't think a problem with the program is in unicode itself. I suspect > the problem is in "unicode-insisting" version of MySQLdb. There have to be > a proper charset/encoding set, and there have to be a proper handling of > them in SQLObject's MySQLConnection (and I am not sure how good that is, > though there is a constant stream of patches from MySQL users...) Agreed, with MySQL you'll definitely need to use the sqlobject_encoding parameter in the datbase connection URI. I'm currently using non-ASCII characters and UnicodeString columns without problems on MySQL 5 with a connection string like: mysql://user:pass@host/dbname/?sqlobject_encoding=latin1 I've gotten the impression that with MySQL 4 one is unlikely to have much luck (haven't tested it myself). Schiavo Simon |
From: Oleg B. <ph...@ph...> - 2007-02-21 14:31:21
|
On Wed, Feb 21, 2007 at 04:24:58PM +0200, Simon Cross wrote: > I've gotten the impression that with MySQL 4 one is unlikely to have > much luck Why? There is a patch at http://sourceforge.net/tracker/index.php?func=detail&aid=1653898&group_id=74338&atid=540674 and I would very much like to hear from a few people who would want to test it... Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |