I'm trying to use UTF-8 with MySQL 4.1. I have a pretty good understanding of how encodings work in Python, but cannot get my code working using MySQLdb.
I have narrowed this down to a fairly minimal test case which doesn't work. My database table looks like:
CREATE TABLE documents ( document_id int(11) NOT NULL auto_increment, filename text CHARACTER SET utf8, authors text CHARACTER SET utf8,
PRIMARY KEY (document_id)
) ENGINE=MyISAM DEFAULT CHARSET utf8
value = u"\uc481"
cursor.execute(u"insert into documents(authors) values(%s)", value)
Running this gives me:
brett$ python encoding-hack.py
latin1
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)
Note:
(1) The encoding being printed out is "latin1", even though the connection encoding is utf8 (set by the MySQL set names statement).
I can do the same thing interactively at the MySQL command line, and it works with no problems.
(2) MySQLdb is crashing when it tries to encode my unicode string using the latin-1 codec; surely it should be using the UTF-8 codec?
I'm using
MySQL 4.1.18
Python 2.4.2
MySQL-Python 1.2.0
on Mac OS X 10.4.4
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
db.charset is set when the connection is created. Choosing another character set with a subsequent SQL statement does not change this. There are three ways to fix this.
1) Set the character set properly when connecting. This can only currently be done with a configuration file and the read_default_file option.
2) Set the correct default character set in the server configuration.
3) This is untested, but I suppose you could try to set db.charset to 'utf8' after you execute your SET NAMES UTF8 statement. I think that ought to work though it's not real pretty.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I had tried all of these, but I have tried them again just to make sure.
*** (1) Updating my.cnf and restarting the server.
I placed the default-character-set everywhere it could possibly go...
/etc/my.cnf
[client]
default-character-set=utf8
The MySQL server
[mysqld]
default-character-set=utf8
[mysql]
default-character-set=utf8
*** Check that mysql reads this file:
brett$ mysql
felix:~/Documents brett$ mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.1.18-standard
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
*** Running my script gives the same 'latin-1' codec error
brett$ python encoding-hack.py
latin1
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)
*** (2) Adding db.charset = 'utf8' -- now gives an 'ascii' codec error
(Note that the name of the encoding in Python is actually 'utf-8', but using this instead doesn't make any difference)
felix:~/Thesis/ExtractionCode brett$ python encoding-hack.py utf8Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
*** (3) Adding read_default_file option, using the same configuration file as above
Changing the connect to:
db = MySQLdb.connect(host="localhost", user="user", passwd='pass', db='mydb',
use_unicode=True, read_default_file='/etc/my.cnf')
Gives:
brett$ python encoding-hack.py
utf8
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
cursor.execute(u"insert into documents(authors) values(%s)", (value,))
Same result:
File "encoding-hack.py", line 21, in ?
cursor.execute(u"insert into documents(authors) values(%s)", (value,))
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a unit test that inserts a wide range of unicode data, i.e.
''.join([ unichr(row134+i13) for i in range(255) ])
where row varies from 0 to 100 and the column is defined as CHAR(255) CHARACTER SET UTF8 (also tried with TEXT) and I cannot replicate your error (also using MySQL-4.1). Add a bug to the tracker and include a test case that reproduces the problem. Remove any of your charset hackery (just use use_unicode=True) and print the connection's charset. Include a minimal table schema. Set default-character-set in a configuration file and reference it with read_default_file on connect(). Include a literal unicode string that fails.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to use UTF-8 with MySQL 4.1. I have a pretty good understanding of how encodings work in Python, but cannot get my code working using MySQLdb.
I have narrowed this down to a fairly minimal test case which doesn't work. My database table looks like:
CREATE TABLE
documents
(document_id
int(11) NOT NULL auto_increment,filename
text CHARACTER SET utf8,authors
text CHARACTER SET utf8,PRIMARY KEY (
document_id
)) ENGINE=MyISAM DEFAULT CHARSET utf8
and my code:
import MySQLdb
db = MySQLdb.connect(host="localhost", user="user", passwd='pass', db='mydb', use_unicode=True)
cursor = db.cursor()
cursor.execute("set names utf8")
print db.charset
value = u"\uc481"
cursor.execute(u"insert into documents(authors) values(%s)", value)
Running this gives me:
brett$ python encoding-hack.py
latin1
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)
Note:
(1) The encoding being printed out is "latin1", even though the connection encoding is utf8 (set by the MySQL set names statement).
I can do the same thing interactively at the MySQL command line, and it works with no problems.
(2) MySQLdb is crashing when it tries to encode my unicode string using the latin-1 codec; surely it should be using the UTF-8 codec?
I'm using
MySQL 4.1.18
Python 2.4.2
MySQL-Python 1.2.0
on Mac OS X 10.4.4
For anyone else who has run into this. The fix is not to use a unicode string as the query string. So, this will work:
cursor.execute("insert into documents(authors) values(%s)", value)
but this won't:
cursor.execute(u"insert into documents(authors) values(%s)", value)
For details, check out the bug report below.
http://sourceforge.net/tracker/index.php?func=detail&aid=1438913&group_id=22307&atid=374932
db.charset is set when the connection is created. Choosing another character set with a subsequent SQL statement does not change this. There are three ways to fix this.
1) Set the character set properly when connecting. This can only currently be done with a configuration file and the read_default_file option.
2) Set the correct default character set in the server configuration.
3) This is untested, but I suppose you could try to set db.charset to 'utf8' after you execute your SET NAMES UTF8 statement. I think that ought to work though it's not real pretty.
Hi,
Thanks for your reply.
I had tried all of these, but I have tried them again just to make sure.
*** (1) Updating my.cnf and restarting the server.
I placed the default-character-set everywhere it could possibly go...
/etc/my.cnf
[client]
default-character-set=utf8
The MySQL server
[mysqld]
default-character-set=utf8
[mysql]
default-character-set=utf8
*** Check that mysql reads this file:
brett$ mysql
felix:~/Documents brett$ mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.1.18-standard
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show variables like 'char%';
+--------------------------+----------------------------------------------------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql-standard-4.1.18-apple-darwin8.2.0-powerpc/share/mysql/charsets/ |
+--------------------------+----------------------------------------------------------------------------------+
7 rows in set (0.00 sec)
mysql>
*** Running my script gives the same 'latin-1' codec error
brett$ python encoding-hack.py
latin1
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)
*** (2) Adding db.charset = 'utf8' -- now gives an 'ascii' codec error
(Note that the name of the encoding in Python is actually 'utf-8', but using this instead doesn't make any difference)
felix:~/Thesis/ExtractionCode brett$ python encoding-hack.py utf8Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
*** (3) Adding read_default_file option, using the same configuration file as above
Changing the connect to:
db = MySQLdb.connect(host="localhost", user="user", passwd='pass', db='mydb',
use_unicode=True, read_default_file='/etc/my.cnf')
Gives:
brett$ python encoding-hack.py
utf8
Traceback (most recent call last):
File "encoding-hack.py", line 22, in ?
cursor.execute(u"insert into documents(authors) values(%s)", value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
Try (value,) instead of value. execute takes a tuple of values for the second parameter.
OK, I changed this line:
cursor.execute(u"insert into documents(authors) values(%s)", (value))
and I still get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
Interestingly, 'forcing' utf8 (by adding the configuration file, defaults file, etc) gives an 'ascii' codec error rather than a 'latin1' error.
That's not what I told you to do. (value,) not (value)
Hi,
Tried that too (I actually tried both before):
cursor.execute(u"insert into documents(authors) values(%s)", (value,))
Same result:
File "encoding-hack.py", line 21, in ?
cursor.execute(u"insert into documents(authors) values(%s)", (value,))
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
self.errorhandler(self, exc, value)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)
I have a unit test that inserts a wide range of unicode data, i.e.
''.join([ unichr(row134+i13) for i in range(255) ])
where row varies from 0 to 100 and the column is defined as CHAR(255) CHARACTER SET UTF8 (also tried with TEXT) and I cannot replicate your error (also using MySQL-4.1). Add a bug to the tracker and include a test case that reproduces the problem. Remove any of your charset hackery (just use use_unicode=True) and print the connection's charset. Include a minimal table schema. Set default-character-set in a configuration file and reference it with read_default_file on connect(). Include a literal unicode string that fails.
and another thing....
I did try changing this line in connections.py
to
but that doesn't work either (ascii codec error).