On Thu, 2007-10-04 at 14:47 +0200, Markus Gritsch wrote:
> On 10/4/07, Oleg Broytmann <phd@...> wrote:
> > On Thu, Oct 04, 2007 at 01:50:52PM +0200, Markus Gritsch wrote:
> > > In MySQLdb queries are *allowed* to be unicode.
> > I meant - there were a period when SQLObject forces queries to be
> > unicode for MySQLdb 1.2.1+. Is that over now? Do we allow unicode but not
> > enforce it regardless of MySQLdb version?
> Ah, I understand. Well, I have not really an idea :( Maybe it would
> be a good idea to decode the string using self.encoding it was the
> case before removing it... David?
I spent nearly all day working on this.
I have to admit that I don't really understand all of the layers fully.
Here's what I thought this morning, not because I understand what
anything does, but because that's what makes the tests pass:
All queries should be plain strings encoded in various arbitrary and
sometimes conflicting methods.
But this is totally not true. The truth is much more complicated.
Imagine you have a table with one latin-1 column and one utf-8 column,
and you want to update both columns. How are you going to send that to
mysql? You'll might think to send it like this (assuming you want both
to be character 0xf1 ):
"update morx set fleem = '\xf1', baz = '\xc3\xb1';"
Where fleem is the latin-1 column and baz is the utf-8 column. This
doesn't actually work, because MySQL expects you to talk to it in some
fixed charset -- which is set per-connection. And this leads into the
bug that I found:
When a column is encoded in UTF-16, everything goes horribly wrong.
I was thinking this would be pretty easy to fix, but it's not, since
there are so many places that encoding and decoding happen: into
sqlobject, on the column, on the query (formerly) in sqlobject, on the
query in mysqldb, maybe inside the mysql client library, on the mysql
There's this additional connection charset parameter which can differ
from the column charset. The connection charset is what determines how
data is sent into mysql. In the test, UTF-8 happens to work because
treating UTF-8 as Latin-1 (the connection charset) doesn't break
anything roundtrip. UTF-16 doesn't work, because while we can insert
UTF-16 into a latin-1 string, it doesn't make the whole string UTF-16.
So when we try to convert back, we get latin-1 treated as UTF-16. Not
So you might think that we should just make our connection charset
always utf-8, and then have our columns convert their utf-16 to utf-8.
But the way that MySQLdb sets the charset is via "SET NAMES", which also
sets the collation to the default collation of that charset -- whether
or not that collation works for the charset of your columns! Using SET
CHARACTER SET instead fixes that, but breaks further on -- non-ascii
characters (which are in latin-1) are being inserted into latin-1
columns as '?'. This is probably because when we send them across the
connection, we don't encode them in any way, so they end up treated as
invalid utf-8. You might ask why we don't go back to encoding the whole
query -- that's because then we would be double-encoding the stuff that
really is utf-8 -- and we still don't know what we're converting the
I believe I am going slightly crazy here.
I will think about this some more, but possibly not immediately. In the
meantime, if anyone has any thoughts for how to fix this encoding
madness, they would be appreciated.