#276 charset='utf8' is ignored in case of bin collation

MySQLdb-1.3
open
Andy Dustman
MySQLdb (285)
5
2012-09-19
2009-03-04
Gregory Golberg
No

This is using MySQLdb 1.2.2 with Python 2.5.

When I have a column defined as

foo varchar(20) character set latin1 collate latin1_bin

the value is returned as a str (not unicode) even when charset='utf8' is specified. However, if no collation is specified:

foo varchar(20) character set latin1

unicode is returned.

Attached is a simple script that shows how if collation is set, a str object is returned, but if not set, a unicode object is (properly) returned. The attached script (reproduced below) returns:

(('collate', u'nocollate'),)

import MySQLdb
con = MySQLdb.connect(host='mysql', db='XXX', user='XXX', passwd='XXX', charset='utf8')
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS collatebug")
cur.execute("CREATE TABLE collatebug (col VARCHAR(20) CHARACTER SET latin1 COLLATE latin1_bin, nocol VARCHAR(20) CHARACTER SET latin1)")
cur.execute("INSERT INTO collatebug (col, nocol) VALUES ('collate', 'nocollate')")
cur.execute("SELECT * FROM collatebug")
res = cur.fetchall()
print res

Discussion

1 2 > >> (Page 1 of 2)
  • Andy Dustman
    Andy Dustman
    2009-03-04

    It's no coincidence that you are using a binary collation and getting binary (I.e. non-unicode) data back.

     
  • But that's just a collation, why would it have an impact on the actual data that's being returned. The character set is latin1 in both cases.

     
  • Andy Dustman
    Andy Dustman
    2009-03-05

    using a binary collation makes it a binary column. see what SHOW CREATE TABLE says about your table schema.

     
  • Here's the SHOW CREATE TABLE result:

    CREATE TABLE collatebug (
    col varchar(20) character set latin1 collate latin1_bin default NULL,
    nocol varchar(20) default NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1

    Here's what MySQL docs say:

    "A non-binary string has a character set and is converted to another character set in many cases, even when the string has a _bin collation."

    http://dev.mysql.com/doc/refman/5.1/en/charset-binary-collations.html

     
  • Andy Dustman
    Andy Dustman
    2009-03-05

    Here's a little test I did:

    mysql> create table col_bin ( col_latin1 varchar(20) character set latin1, col_latin1_bin varchar(20) character set latin1 collate latin1_bin, col_binary varbinary(20));

    Query OK, 0 rows affected (0.01 sec)

    mysql> show create table col_bin;
    ...
    CREATE TABLE col_bin (
    col_latin1 varchar(20) default NULL,
    col_latin1_bin varchar(20) character set latin1 collate latin1_bin default NULL,
    col_binary varbinary(20) default NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    ...

    Now, using MySQLdb, first let's test with a connection character set of utf8 as per your example.

    db=MySQLdb.connect(db="test",read_default_file="~/.my.cnf",charset="utf8")
    c=db.cursor()
    c.execute("select * from col_bin")
    1L
    c.description
    (('col_latin1', 253, 5, 60, 60, 0, 1), ('col_latin1_bin', 253, 5, 60, 60, 0, 1), ('col_binary', 253, 5, 20, 20, 0, 1))
    c.description_flags
    (0, 128, 128)
    c.fetchone()
    (u'test1', 'test2', 'test3')

    Note that in description, the field size is 60 for the col_latin1 and col_latin1_bin columns. description_flags tells us that col_latin1_bin and col_binary both have the BINARY flag set. The last field has a length of 20, but we didn't set a character set, which you can't do on VARBINARY. Accordingly, the first value is returned as unicode and the two flagged binary are returned as normal strings. Also note that all three columns have the same column type (253, which is VAR_STRING).

    Why 60 when we said 20? Because we are using utf8, and I believe each latin1 character could potentially take as much as three bytes in a utf8 encoding. Surprise!

    In this next example, we do the same thing but use latin1 for the connection character set.

    db=MySQLdb.connect(db="test",read_default_file="~/.my.cnf",charset="latin1")
    c=db.cursor()
    c.execute("select * from col_bin")
    1L
    c.description
    (('col_latin1', 253, 5, 20, 20, 0, 1), ('col_latin1_bin', 253, 5, 20, 20, 0, 1), ('col_binary', 253, 5, 20, 20, 0, 1))
    c.description_flags
    (0, 128, 128)
    c.fetchone()
    (u'test1', 'test2', 'test3')

    The field sizes are different from the first example. Maybe not a big surprise because 20 is what we expected in the first place. As with the first test, the last two fields are flagged binary and so only the first field is returned as unicode.

    Summary: Using a binary collation is almost the same as using VARBINARY. You get the BINARY flag set, but you do get some variations with the field size because MySQL doesn't do any sort of character set conversion on VARBINARY.

    Can we tell the difference between those last two fields in MySQLdb? Yes and no. See:

    http://dev.mysql.com/doc/refman/5.0/en/c-api-datatypes.html

    MYSQL_FIELD does have a member charsetnr. 63 means it is a binary character set. So we can test this to see if the field should be returned as unicode or str.

    Currently there is no access to this field in MySQLdb-1.2. And even if there were, there is no sane way to fix the type conversion code to handle it. So this is a Won't Fix for MySQLdb-1.2.

    However, the MYSQL_FIELD object is exposed in MySQLdb-1.3 (SVN trunk and eventually MySQLdb-2.0) and the type conversion code has already been rewritten to be much more flexible, so fixing it there is nearly trivial.

     
  • Andy Dustman
    Andy Dustman
    2009-03-05

    Wasn't just nearly trivial, it was trivial. Fixed in changeset 608. Leaving open since it does affect MySQLdb-1.2. Workaround: You'll have to convert that field to unicode manually.

     
  • Thanks, Andy - I got myself to the point of it setting a binary flag on a column with a _bin collation, I should have updated the ticket to save you some work. But once I got to that point I was wondering whether it is in fact not a MySQLdb but a MySQL bug - what is it doing setting BINARY flag.

     
  • Currently there is no access to this field in MySQLdb-1.2. And even if
    there were, there is no sane way to fix the type conversion code to handle
    it. So this is a Won't Fix for MySQLdb-1.2.

    Hm, wouldn't an equivalent of this fix in 1.2 be just replacing

    if (mask & fields[i].flags) {

    with

    if (fields[i].charsetnr == 63) {

    in _mysql_ResultObject_Initialize() in _mysql.c

     
  • Andy Dustman
    Andy Dustman
    2009-03-05

    Short answer: No.

    Long answer: That checks for generic use of field flags. Currently the only use in conversions is BINARY, but it would potential break any custom converter that uses that mechanism. See MySQLdb.converters.conversions to see where that is used. There are also changes made to it in MySQLdb.connections.Connection. It's too big of a change to make in MySQLdb-1.2.

     
1 2 > >> (Page 1 of 2)