Thread: [SQLObject] bug with UnicodeCol
SQLObject is a Python ORM.
Brought to you by:
ianbicking,
phd
From: Max I. <ma...@uc...> - 2004-12-29 08:21:32
|
Hi, I believe I found a bug with either new UnicodeCol or the selectBy method, but I'm not sure hence post it here instead of SF bug tracker. I have a table which has a column declared like this: class User(SQLObject): ... email = UnicodeCol(length=96, default=None, unique=True) getting and setting the value with unicode string works fine. I.e. both user.email = u'something' and print user.email # -> u'something' works fine. The problem is with the following code: dbusers = User.selectBy(isGuest=False, email=email) n = dbusers.count() It yields an error: File "D:\Python23\Lib\site-packages\sqlobject\main.py", line 1322, in count count = self.accumulate('COUNT(*)') File "D:\Python23\Lib\site-packages\sqlobject\main.py", line 1316, in accumulate return conn.accumulateSelect(self,expression) File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line 256, in accumulateSelect val = int(self.queryOne(q)[0]) File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line 230, in queryOne return self._runWithConnection(self._queryOne, s) File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line 125, in _runWithConnection val = meth(conn, *args) File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line 223, in _queryOne self._executeRetry(conn, c, s) File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line 196, in _executeRetry return cursor.execute(query) TypeError: argument 1 must be str, not unicode Guess this is because email has not been coerced from unicode string to db encoding as being done by normal getters and setters. |
From: Ian B. <ia...@co...> - 2004-12-29 17:11:23
|
Max Ischenko wrote: > getting and setting the value with unicode string works fine. I.e. both > user.email = u'something' > and > print user.email # -> u'something' > works fine. > > The problem is with the following code: > > dbusers = User.selectBy(isGuest=False, email=email) > n = dbusers.count() > > It yields an error: ... > File "D:\Python23\Lib\site-packages\sqlobject\dbconnection.py", line > 196, in _executeRetry > return cursor.execute(query) > TypeError: argument 1 must be str, not unicode > > > Guess this is because email has not been coerced from unicode string to > db encoding as being done by normal getters and setters. That is true. I'm not sure how to best resolve this. There were a couple ideas when unicode columns first came up. One was adding an encoding to converters.UnicodeConverter; really there should be *some* UnicodeConverter even in converters, and perhaps a default encoding (which might default to ASCII, which is implicitly the case now). Ideally, each column would do its own quoting, so that a UnicodeCol would know its own encoding. But while that would allow for a database with multiple encodings (or maybe multiple databases with multiple encodings), that might not be a common-enough use case. I want to get rid of .q entirely, and make columns descriptors with a __sqlrepr__ method; at that point it would be much easier to make this addition. -- Ian Bicking / ia...@co... / http://blog.ianbicking.org |
From: Stuart B. <st...@st...> - 2004-12-29 22:56:58
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Bicking wrote: | That is true. I'm not sure how to best resolve this. There were a | couple ideas when unicode columns first came up. One was adding an | encoding to converters.UnicodeConverter; really there should be *some* | UnicodeConverter even in converters, and perhaps a default encoding | (which might default to ASCII, which is implicitly the case now). | | Ideally, each column would do its own quoting, so that a UnicodeCol | would know its own encoding. But while that would allow for a database | with multiple encodings (or maybe multiple databases with multiple | encodings), that might not be a common-enough use case. I want to get | rid of .q entirely, and make columns descriptors with a __sqlrepr__ | method; at that point it would be much easier to make this addition. Is there actually a use case for allowing each column to have a different encoding? I know for PostgreSQL it is simply a matter of setting the database encoding to Unicode and sending everything as UTF-8 by simply encoding the entire query (which takes care of other issues like Unicode column names as well). The only use cases I can come up with for your scenario should be usng BINARY columns instead of VARCHAR - - in particular, since the database doesn't know the encoding you are using then all your basic string operations, sorting etc. are now broken. Hmm... perhaps if you need to store text in some encoding that doesn't contain the ASCII character set it might be necessary, but I don't know what character sets these are or if any databases actually support them. I've gone through the list of encodings PostgreSQL supports and they all contain the basic latin letters and can be used to encode SQL statements, so I suspect this is not a requirement. As I previously mentioned on this list, we are using an SQLObject patched to do just this - no need for UnicodeCol at all. Just encode the entire query before sending it to the backend, and decode all strings to Unicode on the way back out. Best practice, and no risk of acidently polluting your database with badly encoded data or booby traps set off when code that assumed ASCII gets who-knows-what encoded data. - -- Stuart Bishop <st...@st...> http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFB0zYwAfqZj7rGN0oRAjA5AJsExqjU86R8obzdugpYm46WJpurrgCgnELn xVxoOF3PGLo39KdoE6ZbhAo= =q/y6 -----END PGP SIGNATURE----- |
From: Ian B. <ia...@co...> - 2004-12-30 17:04:09
|
Stuart Bishop wrote: > Is there actually a use case for allowing each column to have a > different encoding? Probably not. Though there's probably a use case for every database connection to have a different encoding. > I know for PostgreSQL it is simply a matter of > setting the database encoding to Unicode and sending everything as UTF-8 > by simply encoding the entire query (which takes care of other issues > like Unicode column names as well). The only use cases I can come up > with for your scenario should be usng BINARY columns instead of VARCHAR > - - in particular, since the database doesn't know the encoding you are > using then all your basic string operations, sorting etc. are now broken. This suggests we should do it in a way that we allow Unicode-aware databases to get Unicode data directly, and other databases use transparent encoding. > Hmm... perhaps if you need to store text in some encoding that doesn't > contain the ASCII character set it might be necessary, but I don't know > what character sets these are or if any databases actually support them. > I've gone through the list of encodings PostgreSQL supports and they all > contain the basic latin letters and can be used to encode SQL > statements, so I suspect this is not a requirement. That seems overly aggressive. It just feels very wrong to encode the entire query. -- Ian Bicking / ia...@co... / http://blog.ianbicking.org |
From: Stuart B. <st...@st...> - 2005-01-02 04:04:12
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Bicking wrote: |> Is there actually a use case for allowing each column to have a |> different encoding? | | Probably not. Though there's probably a use case for every database | connection to have a different encoding. The use case is for non-Unicode aware databases that want to store text in a particular character set. I imagine the encoding would be specified ~ as part of the connection string. There will be an efficiency advantage to specifying the encoding for Unicode aware databases too if your Unicode aware database supports multiple client encodings (such as PostgreSQL), although I suspect it wouldn't be noticible unless SQLObject is modified to use bound parameters. |> I know for PostgreSQL it is simply a matter of |> setting the database encoding to Unicode and sending everything as UTF-8 |> by simply encoding the entire query (which takes care of other issues |> like Unicode column names as well). The only use cases I can come up |> with for your scenario should be usng BINARY columns instead of VARCHAR |> - - in particular, since the database doesn't know the encoding you are |> using then all your basic string operations, sorting etc. are now broken. | | | This suggests we should do it in a way that we allow Unicode-aware | databases to get Unicode data directly, and other databases use | transparent encoding. | |> Hmm... perhaps if you need to store text in some encoding that doesn't |> contain the ASCII character set it might be necessary, but I don't know |> what character sets these are or if any databases actually support them. |> I've gone through the list of encodings PostgreSQL supports and they all |> contain the basic latin letters and can be used to encode SQL |> statements, so I suspect this is not a requirement. | | | That seems overly aggressive. It just feels very wrong to encode the | entire query. Ideally, we just throw a Unicode SQL command at the database driver (which for PostgreSQL, is possible with psycopg2). For psycopg 1, you have to take care of the encoding yourself, which is simply a matter of issuing a 'SET client_encoding TO UNICODE' and then encoding all Unicode strings as UTF8 <rant>(because PostgreSQL, like Java, seems to have decided Unicode == UTF8)</rant>. Encoding the entire query has the advantage that Unicode column names, Unicode table names, Unicode in WHERE clauses etc. are all handled correctly. eg. Foo.select(u"WHERE name >= '\N{LATIN CAPITAL LETTER A WITH GRAVE}") If we don't encode the entire query, developers have to worry about what parts of SQLObject require ASCII only strings and what parts of SQLObject accept Unicode strings which is really frustrating to those of us following the recommended 'Unicode everywhere' practice. It also will cause trouble with modern DB drivers that happily accept Unicode strings and do the right thing because you have no idea what encoding the connection is set to use. It could be argued that the correct thing for them to do if they receive a non-ASCII traditional string is to raise an exception (since the encoding is not known, it can't tell the backend what it is). I don't see any advantage to only encoding portions of the query. Forcing parts of the SQL statement to remain ASCII would be needlessly restrictive and a source of bugs (since Unicode strings are viral in Python, you often find them cropping up in places you didn't expect them). Internally, we have patched SQLObject to *always* return Unicode strings and transparently encode/decode. I'd say *this* might be overly agressive because it is not backwards compatible (and the reason we never pushed this patch back upstream), but there needs to be an option to do it this way because 'Unicode everywhere' has been recommended practice since Unicode support was first bolted onto Python. It also means that when I'm wearing my DBA hat I don't have to worry about other developers pissing in the pool and polluting my nice clean database with meaningless bytestreams. I think the following may be a good design, which maintains backwards compatibility for people working with legacy systems or who are idealogically opposed to working with Unicode strings. It isn't best practice, but might be common ground. It also doesn't involve much work ;) 1) The bulk of SQLObject doesn't care what sort of strings it sees. It just works with strings as Python intended. 2) At the point of issuing the cursor.execute(), the query will be encoded into the encoding the database backend expects to see (or just passed through as a Unicode string if the driver supports that). For non-Unicode aware databases, the developer will need to specify the encoding when opening the connection (defaulting to ASCII) 3) When results are retrieved, they are returned as-is (traditional string, encoded) if the column type is StringCol, or decoded into Unicode if the column type is UnicodeCol. I don't think adding a 'unicode=True' parameter to StringCol would be good, as developers will forget to add it and we end up with hidden bugs again. 4) Docs are updated to use UnicodeCol rather than StringCol. Point 4 is actually important, as otherwise people will continue to use StringCol. This will cause them trouble when they throw Unicode at it (which stores correctly, but they get encoded strings back). - -- Stuart Bishop <st...@st...> http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFB13KVAfqZj7rGN0oRAmTzAJ45l9hc/Ag7I/0UCRt1gdbwP0UhPgCfd+zH dZrfcE6sbDOReEoiV/2CPkA= =hL4y -----END PGP SIGNATURE----- |
From: Oleg B. <ph...@ph...> - 2004-12-30 08:27:58
|
On Wed, Dec 29, 2004 at 10:19:17AM +0200, Max Ischenko wrote: > dbusers = User.selectBy(isGuest=False, email=email) > n = dbusers.count() Try dbusers = User.selectBy(isGuest=False, email=email.encode(dbEncoding)) Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Max I. <ma...@uc...> - 2004-12-30 15:08:56
|
Oleg Broytmann wrote: > On Wed, Dec 29, 2004 at 10:19:17AM +0200, Max Ischenko wrote: > >>dbusers = User.selectBy(isGuest=False, email=email) >>n = dbusers.count() > > > Try dbusers = User.selectBy(isGuest=False, email=email.encode(dbEncoding)) Hey, surely I could figure out that myself, that was not the point. ;-) |
From: Oleg B. <ph...@ph...> - 2004-12-30 15:17:12
|
On Thu, Dec 30, 2004 at 05:06:24PM +0200, Max Ischenko wrote: > > Try dbusers = User.selectBy(isGuest=False, > > email=email.encode(dbEncoding)) > > Hey, surely I could figure out that myself, that was not the point. ;-) But currently it seems as the only way. Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |