This is using MySQLdb 1.2.2 with Python 2.5.
When I have a column defined as
foo varchar(20) character set latin1 collate latin1_bin
the value is returned as a str (not unicode) even when charset='utf8' is specified. However, if no collation is specified:
foo varchar(20) character set latin1
unicode is returned.
Attached is a simple script that shows how if collation is set, a str object is returned, but if not set, a unicode object is (properly) returned. The attached script (reproduced below) returns:
(('collate', u'nocollate'),)
import MySQLdb
con = MySQLdb.connect(host='mysql', db='XXX', user='XXX', passwd='XXX', charset='utf8')
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS collatebug")
cur.execute("CREATE TABLE collatebug (col VARCHAR(20) CHARACTER SET latin1 COLLATE latin1_bin, nocol VARCHAR(20) CHARACTER SET latin1)")
cur.execute("INSERT INTO collatebug (col, nocol) VALUES ('collate', 'nocollate')")
cur.execute("SELECT * FROM collatebug")
res = cur.fetchall()
print res
Test case
It's no coincidence that you are using a binary collation and getting binary (I.e. non-unicode) data back.
But that's just a collation, why would it have an impact on the actual data that's being returned. The character set is latin1 in both cases.
using a binary collation makes it a binary column. see what SHOW CREATE TABLE says about your table schema.
Here's the SHOW CREATE TABLE result:
CREATE TABLE
collatebug
(col
varchar(20) character set latin1 collate latin1_bin default NULL,nocol
varchar(20) default NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1
Here's what MySQL docs say:
"A non-binary string has a character set and is converted to another character set in many cases, even when the string has a _bin collation."
http://dev.mysql.com/doc/refman/5.1/en/charset-binary-collations.html
Here's a little test I did:
mysql> create table col_bin ( col_latin1 varchar(20) character set latin1, col_latin1_bin varchar(20) character set latin1 collate latin1_bin, col_binary varbinary(20));
Query OK, 0 rows affected (0.01 sec)
mysql> show create table col_bin;
...
CREATE TABLE
col_bin
(col_latin1
varchar(20) default NULL,col_latin1_bin
varchar(20) character set latin1 collate latin1_bin default NULL,col_binary
varbinary(20) default NULL) ENGINE=MyISAM DEFAULT CHARSET=latin1
...
Now, using MySQLdb, first let's test with a connection character set of utf8 as per your example.
Note that in description, the field size is 60 for the col_latin1 and col_latin1_bin columns. description_flags tells us that col_latin1_bin and col_binary both have the BINARY flag set. The last field has a length of 20, but we didn't set a character set, which you can't do on VARBINARY. Accordingly, the first value is returned as unicode and the two flagged binary are returned as normal strings. Also note that all three columns have the same column type (253, which is VAR_STRING).
Why 60 when we said 20? Because we are using utf8, and I believe each latin1 character could potentially take as much as three bytes in a utf8 encoding. Surprise!
In this next example, we do the same thing but use latin1 for the connection character set.
The field sizes are different from the first example. Maybe not a big surprise because 20 is what we expected in the first place. As with the first test, the last two fields are flagged binary and so only the first field is returned as unicode.
Summary: Using a binary collation is almost the same as using VARBINARY. You get the BINARY flag set, but you do get some variations with the field size because MySQL doesn't do any sort of character set conversion on VARBINARY.
Can we tell the difference between those last two fields in MySQLdb? Yes and no. See:
http://dev.mysql.com/doc/refman/5.0/en/c-api-datatypes.html
MYSQL_FIELD does have a member charsetnr. 63 means it is a binary character set. So we can test this to see if the field should be returned as unicode or str.
Currently there is no access to this field in MySQLdb-1.2. And even if there were, there is no sane way to fix the type conversion code to handle it. So this is a Won't Fix for MySQLdb-1.2.
However, the MYSQL_FIELD object is exposed in MySQLdb-1.3 (SVN trunk and eventually MySQLdb-2.0) and the type conversion code has already been rewritten to be much more flexible, so fixing it there is nearly trivial.
Wasn't just nearly trivial, it was trivial. Fixed in changeset 608. Leaving open since it does affect MySQLdb-1.2. Workaround: You'll have to convert that field to unicode manually.
Thanks, Andy - I got myself to the point of it setting a binary flag on a column with a _bin collation, I should have updated the ticket to save you some work. But once I got to that point I was wondering whether it is in fact not a MySQLdb but a MySQL bug - what is it doing setting BINARY flag.
Hm, wouldn't an equivalent of this fix in 1.2 be just replacing
if (mask & fields[i].flags) {
with
if (fields[i].charsetnr == 63) {
in _mysql_ResultObject_Initialize() in _mysql.c
Short answer: No.
Long answer: That checks for generic use of field flags. Currently the only use in conversions is BINARY, but it would potential break any custom converter that uses that mechanism. See MySQLdb.converters.conversions to see where that is used. There are also changes made to it in MySQLdb.connections.Connection. It's too big of a change to make in MySQLdb-1.2.
Sorry to jump in... but this was working with 1.2.1. I've just been hitting this same problem yesterday because we decided to go to Python2.6 and we installed 1.2.2 and some of my code broke.
Python 2.5 (r25:51908, Nov 27 2006, 19:14:46)
[GCC 4.1.2 20061115 (prerelease) (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.