I do not want to specify a charset. I'm using
SQLAlchemy, and I would prefer if SQLAlchemy would take
care of encoding and decoding.
In the following example,
'\xc3\xa7'.decode('UTF-8') == u'\xe7'
The following results in correct data going into and
coming out of the database, but the data in the
database itself looks double encoded:
import MySQLdb
connection = MySQLdb.connect(
host="fmapp03", user="user",
passwd='foobar', db="users") # No charset.
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'foo@foo.com',
'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", ('\xc3\xa7',)) # Pass UTF-8.
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print row
# I get the correct response.
connection.commit()
The following results in correct data going into and
out of the database, but does not result in the data in
the database itself being double encoded:
import MySQLdb
connection = MySQLdb.connect(host="fmapp03",
user="user", passwd='foobar', db="users",
charset='utf8') # Charset.
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'foo@foo.com',
'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", (u'\xe7',)) # Pass unicode object.
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print row
# I get the correct response.
connection.commit()
It looks like for the version of MySQLdb I'm using,
1.2.1p2, a lot of this stuff has changed. If you don't
let MySQLdb take care of encoding and decoding, it ends
up double encoding things in the database.
By the way, I have not modified my my.cnf. I'm running
mysql Ver 14.12 Distrib 5.0.22, for pc-linux-gnu
(i486) using readline 5.1
on Ubuntu 6.06.
Logged In: YES
user_id=71372
Originator: NO
Can you replicate this with the SVN trunk or 1.2.2b3?
Logged In: YES
user_id=30164
Originator: YES
The behavior is unchanged in 1.2.2b3.
Logged In: YES
user_id=30164
Originator: YES
I'm confused. If I cut and paste a cedilla (UTF-8 '\xc3\xa7') into the MySQL shell, I see that the bit length of the field is 32 bits. If I use a normal "f" instead, it's only 8 bits. The value '\xc3\xa7' should be 16 bits. Is the double encoding happening deep in the C code?
Logged In: YES
user_id=71372
Originator: NO
For the following, I am using Ubuntu Edgy:
mysql> select length('ç');
+--------------+
| length('ç') |
+--------------+
| 2 |
+--------------+
1 row in set (0.05 sec)
OK, so I'm getting the right value back as a UTF-8-encoded string, not a unicode object, even though I passed in unicode, and the character set for the column is utf8.
I think your assumptions from your examples are basically wrong. '\xc3\xa7' is not double-encoded. Both your examples say you get the correct repsonse. You need an example (with schema) that shows the wrong thing happening.
Logged In: YES
user_id=71372
Originator: NO
As an additional clarifcation, the previous example used use_unicode=False for connect() (the default). With use_unicode=True:
And now we have unicode as the result, which is what it should do.
Logged In: YES
user_id=30164
Originator: YES
Thanks for looking into this. What you are saying does not contradict what I am saying. Round trip, the data works out fine. However, using the mysql client, look at what's actually stored in the database. It's 32 bits, when it should be 16 bits. I'll be happy if you can show me where I'm wrong.
Logged In: YES
user_id=71372
Originator: NO
I can't replicate your results. If you look two of my comments back, you'll see my mysql CLI results, which only shows the cedilla as two bytes, which is what you (and I) expect. The only thing I can think of at this point is that your database/table/column is set to use ucs4 or something like that and not utf8. That doesn't conform with your observations on 'f' though, I think. (Yes, the example is not actually selecting from the database, but I also tried it with the inserted data and got the same results.)
Recheck your schema carefully, i.e. look at SHOW CREATE TABLE xxx; this should show any unexpected character sets.
Also keep in mind, MySQLdb always uses a character set. Even if you don't specify one with the charset option, it uses the connection default character set which it gets from the mysql_get_character_set_info() C API call. Provided that your default character set is utf8, the only difference between the two examples is that on the first you are passing an encoded string and on the second you are passing a unicode object, which is internally converted to an encoded string.
The use_unicode option only determines whether or not character values returned from the database are returned as unicode. You can always pass unicode values to execute() whether this is set or not.
Logged In: YES
user_id=30164
Originator: YES
Hmm:
mysql> show create table users \G
****** 1. row ******
Table: users
Create Table: CREATE TABLE
users
(id
bigint(20) unsigned NOT NULL,username
varchar(255) collate utf8_bin NOT NULL,email
varchar(255) collate utf8_bin NOT NULL,password
varchar(64) collate utf8_bin NOT NULL,firstName
varchar(128) collate utf8_bin NOT NULL default 'n/a',lastName
varchar(128) collate utf8_bin NOT NULL default 'n/a',url
varchar(128) collate utf8_bin NOT NULL default 'n/a',homedirUrl
varchar(128) collate utf8_bin NOT NULL default 'n/a',admin
tinyint(1) NOT NULL default '0',creationtime
timestamp NOT NULL default '0000-00-00 00:00:00',modtime
timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,PRIMARY KEY (
id
),UNIQUE KEY
email
(email
),UNIQUE KEY
username
(username
)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)
mysql> select * from users \G
****** 5. row ******
id: 12345679
username: jjtest1235
email: fo1@foo.com
password: pass
firstName: ç
lastName: asdf
url: N/A
homedirUrl: N/A
admin: 0
creationtime: 2007-02-14 16:39:50
modtime: 2007-02-14 16:58:49
5 rows in set (0.00 sec)
mysql> select length(firstName) from users where id = 12345679;
+-------------------+
| length(firstName) |
+-------------------+
| 4 |
+-------------------+
1 row in set (0.08 sec)
Why do you get a length of 2, and I get a length of 4? Oh, wait a second, you're selecting the length of a fixed character instead of inserting it into the database and then selecting the length. Observe!
mysql> select length('ç');
+--------------+
| length('ç') |
+--------------+
| 2 |
+--------------+
1 row in set (0.00 sec)
mysql> insert into users values (9, 'test9', 'test9@9.com', 'asdf', 'ç', 'last', 'N/A', 'N/A', 0, NOW(), NOW());
Query OK, 1 row affected (0.11 sec)
mysql> select length(firstName) from users where id = 9;
+-------------------+
| length(firstName) |
+-------------------+
| 4 |
+-------------------+
1 row in set (0.00 sec)
I'm guessing it's safe to say that MySQLdb isn't to blame, but something is clearly wrong, and I wish I knew what it was! Remember, by the time the data gets back into my Python script, it's correct. Hmm, I wonder if ç is one of those weird Unicode combining characters that can be represented in two different ways.
Thanks for your time.
Logged In: YES
user_id=30164
Originator: YES
I see in the most recent changelog:
Only encode the query (convert to a string) when it is a unicode instance;
re-encoding encoded strings would break things.
This matches what I was encountering. I'm glad to see this got fixed :)