I'm running the latest release of MySQLdb, 1.2.1_p2 on FreeBSD 4.11 with Python 2.4.1 and MySQL 4.1.18.
Show variables gives me 'utf8' for all the character_set variables (client, connection, database, results, server, and system) and my [mysql] default_character_set is also utf8
Now, what I can't figure out is while I can SELECT and read out UTF-8 fine, that I can't seem to UPDATE UTF-8.
I open my connection w/ charset='utf8', use_unicode=True and then call conn.set_character_set('utf8'), and cursor.execute("set character set utf8") and cursor.execute("set names utf8") and conn.charset = "utf8") to no avail. When I preint the conn.character_set_name(), it says 'latin1'
And while I can send a unicode object, it gives me the following:
File "mysql/MySQLdb/cursors.py", line 146, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 44-45: ordinal not in range(256)
When I try to force the charset to UTF8, I get a truncation data:
./04-make-updates.py:52: Warning: Data truncated for column 'name' at row 1
cursor.execute(sql)
It seems pretty obvious that I just need to figure a way to get the connection out of latin 1 and set as utf8, but I'm not understanding why it's not being set properly.
If you are getting a truncation error, that is because utf8 is a multi-byte, variable-length character set, and your data has more bytes than it does characters.
Keep it simple, use connect(use_unicode=True, charset="utf8").
Note that the default my.cnf file is NOT read by default. I like to create a local config file and read that with connect(read_default_file="/path")
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Andy, I really did try every single way to force it, including use_unicode and charset, to no avail. (also, when I tried loading a my.cnf, I was passing a read_default_file on connect):
cursor = conn.cursor()
cursor.execute("set character set 'utf8'")
cursor.execute('set names utf8')
conn.charset = "utf-8"
conn.commit()
print conn.character_set_name()
sys.exit()
</pre>
It seems like something should work, right? I even tried recompiling the _mysql.c file replacing the default latin1 w/ utf8, but still got nothing on the character_set_name(). I saw a post in the MySQL docs on a bug w/ character_set_name() not being fixed until 4.1.19, but that was with mixing up collations...
I'm really stumped. The particular project I was working on was something that I could do in a batch format, so I just rewrote the code to output the UPDATES into a sql file and was able to load it in via the mysqlclient w/o problems.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
connect(charset='utf8') internally does conn.set_character_set('utf8') if the default character set is not the one you specified. Executing SQL to set the character set won't work properly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, by "didn't work," I mean that when I create a connection w/ connect(charset='utf8', use_unicode=T
rue), conn.character_set_name() returns latin1, not utf8 and that as a result it crushes the utf8 that I'm submitting.
Should this work? I've removed all the other calls. If character_set_name() should return utf8 and there aren't any clear steps I can try, I'll just have to follow up later when I have more time to see what's going on.
Thanks for your help, either way, it's definitely appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried both of those adding charset="utf8" before and after changing dashes in my ~/.my.conf file - still stuck on latin1.
But, your comment on the cnf file made me take a second look... and setting [mysql] to [client] did the trick in changing the default. I still can't change the characterset with db.character_set_name() but if anyone is having the problems I am (it must be specific to either my Python or MySQL build) hopefully this thread can help.
What's strange is in Perl, calling $db->do("SET NAMES 'utf8'") works without a hitch.
Thanks again for your assistance in chasing this down.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm running the latest release of MySQLdb, 1.2.1_p2 on FreeBSD 4.11 with Python 2.4.1 and MySQL 4.1.18.
Show variables gives me 'utf8' for all the character_set variables (client, connection, database, results, server, and system) and my [mysql] default_character_set is also utf8
Now, what I can't figure out is while I can SELECT and read out UTF-8 fine, that I can't seem to UPDATE UTF-8.
I open my connection w/ charset='utf8', use_unicode=True and then call conn.set_character_set('utf8'), and cursor.execute("set character set utf8") and cursor.execute("set names utf8") and conn.charset = "utf8") to no avail. When I preint the conn.character_set_name(), it says 'latin1'
And while I can send a unicode object, it gives me the following:
File "mysql/MySQLdb/cursors.py", line 146, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 44-45: ordinal not in range(256)
When I try to force the charset to UTF8, I get a truncation data:
./04-make-updates.py:52: Warning: Data truncated for column 'name' at row 1
cursor.execute(sql)
It seems pretty obvious that I just need to figure a way to get the connection out of latin 1 and set as utf8, but I'm not understanding why it's not being set properly.
This problem seems to be similar to this person's: http://woss.name/2005/11/04/dire-not-understanding-of-unicode/
You missed this comment:
http://woss.name/2005/11/04/dire-not-understanding-of-unicode/#comment-157
If you are getting a truncation error, that is because utf8 is a multi-byte, variable-length character set, and your data has more bytes than it does characters.
Keep it simple, use connect(use_unicode=True, charset="utf8").
Note that the default my.cnf file is NOT read by default. I like to create a local config file and read that with connect(read_default_file="/path")
Andy, I really did try every single way to force it, including use_unicode and charset, to no avail. (also, when I tried loading a my.cnf, I was passing a read_default_file on connect):
<pre>
UTF8 writing cursor
conn = MySQLdb.connect(..., charset='utf8', use_unicode=True)
conn.set_character_set('utf8')
cursor = conn.cursor()
cursor.execute("set character set 'utf8'")
cursor.execute('set names utf8')
conn.charset = "utf-8"
conn.commit()
print conn.character_set_name()
sys.exit()
</pre>
It seems like something should work, right? I even tried recompiling the _mysql.c file replacing the default latin1 w/ utf8, but still got nothing on the character_set_name(). I saw a post in the MySQL docs on a bug w/ character_set_name() not being fixed until 4.1.19, but that was with mixing up collations...
I'm really stumped. The particular project I was working on was something that I could do in a batch format, so I just rewrote the code to output the UPDATES into a sql file and was able to load it in via the mysqlclient w/o problems.
connect(charset='utf8') internally does conn.set_character_set('utf8') if the default character set is not the one you specified. Executing SQL to set the character set won't work properly.
Worth mentioning that I started off with connect(use_unicode=True, charset="utf8") and that didn't work...
Define "didn't work". The truncation error actually indicates that it did work, i.e. the encoded string became too long.
So, by "didn't work," I mean that when I create a connection w/ connect(charset='utf8', use_unicode=T
rue), conn.character_set_name() returns latin1, not utf8 and that as a result it crushes the utf8 that I'm submitting.
Should this work? I've removed all the other calls. If character_set_name() should return utf8 and there aren't any clear steps I can try, I'll just have to follow up later when I have more time to see what's going on.
Thanks for your help, either way, it's definitely appreciated.
Try this in your command-line client:
show character set like '%'; # or
show character set like 'utf%';
and make sure utf8 is actually built into your server.
Example with default character set of utf8:
Python 2.4.3 (#1, May 8 2006, 15:15:48)
[GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
>>> db.character_set_name()
'utf8'
>>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
>>> db.character_set_name() 'latin1'
>>> db.set_character_set('utf8')
>>> db.character_set_name()
'utf8'
>>> db.get_server_info()
'5.0.21-debug-log'
Example with default character set of latin1 (in .my.cnf):
Python 2.4.3 (#1, May 8 2006, 15:15:48)
[GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
>>> db.character_set_name()
'latin1'
>>> db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")
>>> db.character_set_name()
'utf8'
>>> db.set_character_set('latin1')
>>> db.character_set_name()
'latin1'
>>>
show character set like 'utf%' gives me a listing for utf8
Here's my output:
<pre>
$ cat ~/.my.cnf
[mysql]
default_character_set = utf8
$ python
Python 2.4.1 (#2, Aug 12 2005, 18:39:37)
[GCC 2.95.3 20010315 (release)] on freebsd4
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append('mysql')
>>> import MySQLdb
>>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
>>> db.character_set_name()
'latin1'
>>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
>>> db.character_set_name()
'latin1'
>>> db.set_character_set('utf8')
>>> db.character_set_name()
'latin1'
>>> db.get_server_info()
'4.1.18-Yahoo-SMP-log'
>>>
</pre>
Could this be a MySQL bug? Is there something wrong w/ my .my.cnf?
Isn't it supposed to be (in .my.cnf):
default-character-set = utf8
(i.e. dashes not underscores)
?
Also, can you try:
db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")
I tried both of those adding charset="utf8" before and after changing dashes in my ~/.my.conf file - still stuck on latin1.
But, your comment on the cnf file made me take a second look... and setting [mysql] to [client] did the trick in changing the default. I still can't change the characterset with db.character_set_name() but if anyone is having the problems I am (it must be specific to either my Python or MySQL build) hopefully this thread can help.
What's strange is in Perl, calling $db->do("SET NAMES 'utf8'") works without a hitch.
Thanks again for your assistance in chasing this down.