Menu

#218 data in database is double encoded if no charset

MySQLdb-1.2
closed
MySQLdb (285)
5
2014-07-29
2006-11-08
No

I do not want to specify a charset. I'm using
SQLAlchemy, and I would prefer if SQLAlchemy would take
care of encoding and decoding.

In the following example,

'\xc3\xa7'.decode('UTF-8') == u'\xe7'

The following results in correct data going into and
coming out of the database, but the data in the
database itself looks double encoded:

import MySQLdb

connection = MySQLdb.connect(
host="fmapp03", user="user",
passwd='foobar', db="users") # No charset.
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'foo@foo.com',
'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", ('\xc3\xa7',)) # Pass UTF-8.
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print row # I get the correct response.
connection.commit()

The following results in correct data going into and
out of the database, but does not result in the data in
the database itself being double encoded:

import MySQLdb

connection = MySQLdb.connect(host="fmapp03",
user="user", passwd='foobar', db="users",
charset='utf8') # Charset.
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'foo@foo.com',
'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", (u'\xe7',)) # Pass unicode object.
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print row # I get the correct response.
connection.commit()

It looks like for the version of MySQLdb I'm using,
1.2.1p2, a lot of this stuff has changed. If you don't
let MySQLdb take care of encoding and decoding, it ends
up double encoding things in the database.

By the way, I have not modified my my.cnf. I'm running

mysql Ver 14.12 Distrib 5.0.22, for pc-linux-gnu
(i486) using readline 5.1

on Ubuntu 6.06.

Discussion

  • Andy Dustman

    Andy Dustman - 2007-02-10

    Logged In: YES
    user_id=71372
    Originator: NO

    Can you replicate this with the SVN trunk or 1.2.2b3?

     
  • Shannon -jj Behrens

    Logged In: YES
    user_id=30164
    Originator: YES

    The behavior is unchanged in 1.2.2b3.

     
  • Shannon -jj Behrens

    Logged In: YES
    user_id=30164
    Originator: YES

    I'm confused. If I cut and paste a cedilla (UTF-8 '\xc3\xa7') into the MySQL shell, I see that the bit length of the field is 32 bits. If I use a normal "f" instead, it's only 8 bits. The value '\xc3\xa7' should be 16 bits. Is the double encoding happening deep in the C code?

     
  • Andy Dustman

    Andy Dustman - 2007-02-15

    Logged In: YES
    user_id=71372
    Originator: NO

    For the following, I am using Ubuntu Edgy:

    mysql> select length('ç');
    +--------------+
    | length('ç') |
    +--------------+
    | 2 |
    +--------------+
    1 row in set (0.05 sec)

    c='ç'
    c
    '\xc3\xa7'
    cu=u'ç'
    cu
    u'\xe7'
    cu.encode('utf8')
    '\xc3\xa7'

    c=db.cursor()
    c.execute("create table unitest (sometext varchar(16))")
    0L
    c.execute("insert into unitest (sometext) values (%s)", (cu,))
    1L
    c.execute("select * from unitest")
    1L
    c.fetchall()
    (('\xc3\xa7',),)

    OK, so I'm getting the right value back as a UTF-8-encoded string, not a unicode object, even though I passed in unicode, and the character set for the column is utf8.

    I think your assumptions from your examples are basically wrong. '\xc3\xa7' is not double-encoded. Both your examples say you get the correct repsonse. You need an example (with schema) that shows the wrong thing happening.

     
  • Andy Dustman

    Andy Dustman - 2007-02-15

    Logged In: YES
    user_id=71372
    Originator: NO

    As an additional clarifcation, the previous example used use_unicode=False for connect() (the default). With use_unicode=True:

    db=MySQLdb.connect(read_default_file="~/.my.cnf", db="test", use_unicode=True)
    c=db.cursor()
    c.execute("select * from unitest")
    1L
    c.fetchall()
    ((u'\xe7',),)

    And now we have unicode as the result, which is what it should do.

     
  • Shannon -jj Behrens

    Logged In: YES
    user_id=30164
    Originator: YES

    Thanks for looking into this. What you are saying does not contradict what I am saying. Round trip, the data works out fine. However, using the mysql client, look at what's actually stored in the database. It's 32 bits, when it should be 16 bits. I'll be happy if you can show me where I'm wrong.

     
  • Andy Dustman

    Andy Dustman - 2007-02-16

    Logged In: YES
    user_id=71372
    Originator: NO

    I can't replicate your results. If you look two of my comments back, you'll see my mysql CLI results, which only shows the cedilla as two bytes, which is what you (and I) expect. The only thing I can think of at this point is that your database/table/column is set to use ucs4 or something like that and not utf8. That doesn't conform with your observations on 'f' though, I think. (Yes, the example is not actually selecting from the database, but I also tried it with the inserted data and got the same results.)

    Recheck your schema carefully, i.e. look at SHOW CREATE TABLE xxx; this should show any unexpected character sets.

    Also keep in mind, MySQLdb always uses a character set. Even if you don't specify one with the charset option, it uses the connection default character set which it gets from the mysql_get_character_set_info() C API call. Provided that your default character set is utf8, the only difference between the two examples is that on the first you are passing an encoded string and on the second you are passing a unicode object, which is internally converted to an encoded string.

    The use_unicode option only determines whether or not character values returned from the database are returned as unicode. You can always pass unicode values to execute() whether this is set or not.

     
  • Shannon -jj Behrens

    Logged In: YES
    user_id=30164
    Originator: YES

    Hmm:

    mysql> show create table users \G
    ****** 1. row ******
    Table: users
    Create Table: CREATE TABLE users (
    id bigint(20) unsigned NOT NULL,
    username varchar(255) collate utf8_bin NOT NULL,
    email varchar(255) collate utf8_bin NOT NULL,
    password varchar(64) collate utf8_bin NOT NULL,
    firstName varchar(128) collate utf8_bin NOT NULL default 'n/a',
    lastName varchar(128) collate utf8_bin NOT NULL default 'n/a',
    url varchar(128) collate utf8_bin NOT NULL default 'n/a',
    homedirUrl varchar(128) collate utf8_bin NOT NULL default 'n/a',
    admin tinyint(1) NOT NULL default '0',
    creationtime timestamp NOT NULL default '0000-00-00 00:00:00',
    modtime timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
    PRIMARY KEY (id),
    UNIQUE KEY email (email),
    UNIQUE KEY username (username)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
    1 row in set (0.00 sec)

    mysql> select * from users \G
    ****** 5. row ******
    id: 12345679
    username: jjtest1235
    email: fo1@foo.com
    password: pass
    firstName: ç
    lastName: asdf
    url: N/A
    homedirUrl: N/A
    admin: 0
    creationtime: 2007-02-14 16:39:50
    modtime: 2007-02-14 16:58:49
    5 rows in set (0.00 sec)

    mysql> select length(firstName) from users where id = 12345679;
    +-------------------+
    | length(firstName) |
    +-------------------+
    | 4 |
    +-------------------+
    1 row in set (0.08 sec)

    Why do you get a length of 2, and I get a length of 4? Oh, wait a second, you're selecting the length of a fixed character instead of inserting it into the database and then selecting the length. Observe!

    mysql> select length('ç');
    +--------------+
    | length('ç') |
    +--------------+
    | 2 |
    +--------------+
    1 row in set (0.00 sec)

    mysql> insert into users values (9, 'test9', 'test9@9.com', 'asdf', 'ç', 'last', 'N/A', 'N/A', 0, NOW(), NOW());
    Query OK, 1 row affected (0.11 sec)
    mysql> select length(firstName) from users where id = 9;
    +-------------------+
    | length(firstName) |
    +-------------------+
    | 4 |
    +-------------------+
    1 row in set (0.00 sec)

    I'm guessing it's safe to say that MySQLdb isn't to blame, but something is clearly wrong, and I wish I knew what it was! Remember, by the time the data gets back into my Python script, it's correct. Hmm, I wonder if ç is one of those weird Unicode combining characters that can be represented in two different ways.

    Thanks for your time.

     
  • Shannon -jj Behrens

    Logged In: YES
    user_id=30164
    Originator: YES

    I see in the most recent changelog:

    Only encode the query (convert to a string) when it is a unicode instance;
    re-encoding encoded strings would break things.

    This matches what I was encountering. I'm glad to see this got fixed :)

     

Log in to post a comment.