MySQL for Python / Bugs / #218 data in database is double encoded if no charset

Andy Dustman - 2007-02-10

Logged In: YES
user_id=71372
Originator: NO

Can you replicate this with the SVN trunk or 1.2.2b3?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shannon -jj Behrens - 2007-02-15

Logged In: YES
user_id=30164
Originator: YES

The behavior is unchanged in 1.2.2b3.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shannon -jj Behrens - 2007-02-15

Logged In: YES
user_id=30164
Originator: YES

I'm confused. If I cut and paste a cedilla (UTF-8 '\xc3\xa7') into the MySQL shell, I see that the bit length of the field is 32 bits. If I use a normal "f" instead, it's only 8 bits. The value '\xc3\xa7' should be 16 bits. Is the double encoding happening deep in the C code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Dustman - 2007-02-15

Logged In: YES
user_id=71372
Originator: NO

For the following, I am using Ubuntu Edgy:

mysql> select length('ç');
+--------------+
| length('ç') |
+--------------+
| 2 |
+--------------+
1 row in set (0.05 sec)

c='ç'
c
'\xc3\xa7'
cu=u'ç'
cu
u'\xe7'
cu.encode('utf8')
'\xc3\xa7'

c=db.cursor()
c.execute("create table unitest (sometext varchar(16))")
0L
c.execute("insert into unitest (sometext) values (%s)", (cu,))
1L
c.execute("select * from unitest")
1L
c.fetchall()
(('\xc3\xa7',),)

OK, so I'm getting the right value back as a UTF-8-encoded string, not a unicode object, even though I passed in unicode, and the character set for the column is utf8.

I think your assumptions from your examples are basically wrong. '\xc3\xa7' is not double-encoded. Both your examples say you get the correct repsonse. You need an example (with schema) that shows the wrong thing happening.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Dustman - 2007-02-15

Logged In: YES
user_id=71372
Originator: NO

As an additional clarifcation, the previous example used use_unicode=False for connect() (the default). With use_unicode=True:

db=MySQLdb.connect(read_default_file="~/.my.cnf", db="test", use_unicode=True)
c=db.cursor()
c.execute("select * from unitest")
1L
c.fetchall()
((u'\xe7',),)

And now we have unicode as the result, which is what it should do.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shannon -jj Behrens - 2007-02-16

Logged In: YES
user_id=30164
Originator: YES

Thanks for looking into this. What you are saying does not contradict what I am saying. Round trip, the data works out fine. However, using the mysql client, look at what's actually stored in the database. It's 32 bits, when it should be 16 bits. I'll be happy if you can show me where I'm wrong.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Dustman - 2007-02-16

Logged In: YES
user_id=71372
Originator: NO

I can't replicate your results. If you look two of my comments back, you'll see my mysql CLI results, which only shows the cedilla as two bytes, which is what you (and I) expect. The only thing I can think of at this point is that your database/table/column is set to use ucs4 or something like that and not utf8. That doesn't conform with your observations on 'f' though, I think. (Yes, the example is not actually selecting from the database, but I also tried it with the inserted data and got the same results.)

Recheck your schema carefully, i.e. look at SHOW CREATE TABLE xxx; this should show any unexpected character sets.

Also keep in mind, MySQLdb always uses a character set. Even if you don't specify one with the charset option, it uses the connection default character set which it gets from the mysql_get_character_set_info() C API call. Provided that your default character set is utf8, the only difference between the two examples is that on the first you are passing an encoded string and on the second you are passing a unicode object, which is internally converted to an encoded string.

The use_unicode option only determines whether or not character values returned from the database are returned as unicode. You can always pass unicode values to execute() whether this is set or not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shannon -jj Behrens - 2007-02-16

Logged In: YES
user_id=30164
Originator: YES

Hmm:

mysql> show create table users \G
****** 1. row ******
Table: users
Create Table: CREATE TABLE users (
id bigint(20) unsigned NOT NULL,
username varchar(255) collate utf8_bin NOT NULL,
email varchar(255) collate utf8_bin NOT NULL,
password varchar(64) collate utf8_bin NOT NULL,
firstName varchar(128) collate utf8_bin NOT NULL default 'n/a',
lastName varchar(128) collate utf8_bin NOT NULL default 'n/a',
url varchar(128) collate utf8_bin NOT NULL default 'n/a',
homedirUrl varchar(128) collate utf8_bin NOT NULL default 'n/a',
admin tinyint(1) NOT NULL default '0',
creationtime timestamp NOT NULL default '0000-00-00 00:00:00',
modtime timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
PRIMARY KEY (id),
UNIQUE KEY email (email),
UNIQUE KEY username (username)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> select * from users \G
****** 5. row ******
id: 12345679
username: jjtest1235
email: fo1@foo.com
password: pass
firstName: ç
lastName: asdf
url: N/A
homedirUrl: N/A
admin: 0
creationtime: 2007-02-14 16:39:50
modtime: 2007-02-14 16:58:49
5 rows in set (0.00 sec)

mysql> select length(firstName) from users where id = 12345679;
+-------------------+
| length(firstName) |
+-------------------+
| 4 |
+-------------------+
1 row in set (0.08 sec)

Why do you get a length of 2, and I get a length of 4? Oh, wait a second, you're selecting the length of a fixed character instead of inserting it into the database and then selecting the length. Observe!

mysql> select length('ç');
+--------------+
| length('ç') |
+--------------+
| 2 |
+--------------+
1 row in set (0.00 sec)

mysql> insert into users values (9, 'test9', 'test9@9.com', 'asdf', 'ç', 'last', 'N/A', 'N/A', 0, NOW(), NOW());
Query OK, 1 row affected (0.11 sec)
mysql> select length(firstName) from users where id = 9;
+-------------------+
| length(firstName) |
+-------------------+
| 4 |
+-------------------+
1 row in set (0.00 sec)

I'm guessing it's safe to say that MySQLdb isn't to blame, but something is clearly wrong, and I wish I knew what it was! Remember, by the time the data gets back into my Python script, it's correct. Hmm, I wonder if ç is one of those weird Unicode combining characters that can be represented in two different ways.

Thanks for your time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shannon -jj Behrens - 2007-04-10

Logged In: YES
user_id=30164
Originator: YES

I see in the most recent changelog:

Only encode the query (convert to a string) when it is a unicode instance;
re-encoding encoded strings would break things.

This matches what I was encountering. I'm glad to see this got fixed :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

data in database is double encoded if no charset

MySQL database connector for Python programming

Group

Searches

Help

#218 data in database is double encoded if no charset

Discussion