Menu

Problems with UTF-8 and MySQL 4.1

Help
2006-02-24
2012-09-19
  • Brett Powley

    Brett Powley - 2006-02-24

    I'm trying to use UTF-8 with MySQL 4.1. I have a pretty good understanding of how encodings work in Python, but cannot get my code working using MySQLdb.

    I have narrowed this down to a fairly minimal test case which doesn't work. My database table looks like:

    CREATE TABLE documents (
    document_id int(11) NOT NULL auto_increment,
    filename text CHARACTER SET utf8,
    authors text CHARACTER SET utf8,
    PRIMARY KEY (document_id)
    ) ENGINE=MyISAM DEFAULT CHARSET utf8

    and my code:

    import MySQLdb

    db = MySQLdb.connect(host="localhost", user="user", passwd='pass', db='mydb', use_unicode=True)
    cursor = db.cursor()
    cursor.execute("set names utf8")

    print db.charset

    value = u"\uc481"
    cursor.execute(u"insert into documents(authors) values(%s)", value)

    Running this gives me:

    brett$ python encoding-hack.py
    latin1
    Traceback (most recent call last):
    File "encoding-hack.py", line 22, in ?
    cursor.execute(u"insert into documents(authors) values(%s)", value)
    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
    self.errorhandler(self, exc, value)
    File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
    raise errorclass, errorvalue
    UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)

    Note:
    (1) The encoding being printed out is "latin1", even though the connection encoding is utf8 (set by the MySQL set names statement).
    I can do the same thing interactively at the MySQL command line, and it works with no problems.

    (2) MySQLdb is crashing when it tries to encode my unicode string using the latin-1 codec; surely it should be using the UTF-8 codec?

    I'm using
    MySQL 4.1.18
    Python 2.4.2
    MySQL-Python 1.2.0
    on Mac OS X 10.4.4

     
    • Brett Powley

      Brett Powley - 2006-02-26

      For anyone else who has run into this. The fix is not to use a unicode string as the query string. So, this will work:

      cursor.execute("insert into documents(authors) values(%s)", value)

      but this won't:

      cursor.execute(u"insert into documents(authors) values(%s)", value)

      For details, check out the bug report below.

      http://sourceforge.net/tracker/index.php?func=detail&aid=1438913&group_id=22307&atid=374932

       
    • Andy Dustman

      Andy Dustman - 2006-02-24

      db.charset is set when the connection is created. Choosing another character set with a subsequent SQL statement does not change this. There are three ways to fix this.

      1) Set the character set properly when connecting. This can only currently be done with a configuration file and the read_default_file option.

      2) Set the correct default character set in the server configuration.

      3) This is untested, but I suppose you could try to set db.charset to 'utf8' after you execute your SET NAMES UTF8 statement. I think that ought to work though it's not real pretty.

       
      • Brett Powley

        Brett Powley - 2006-02-25

        Hi,

        Thanks for your reply.

        I had tried all of these, but I have tried them again just to make sure.

        *** (1) Updating my.cnf and restarting the server.
        I placed the default-character-set everywhere it could possibly go...

        /etc/my.cnf

        [client]
        default-character-set=utf8

        The MySQL server

        [mysqld]
        default-character-set=utf8

        [mysql]
        default-character-set=utf8

        *** Check that mysql reads this file:

        brett$ mysql
        felix:~/Documents brett$ mysql -uroot -p
        Enter password:
        Welcome to the MySQL monitor. Commands end with ; or \g.
        Your MySQL connection id is 2 to server version: 4.1.18-standard

        Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

        mysql> show variables like 'char%';
        +--------------------------+----------------------------------------------------------------------------------+
        | Variable_name | Value |
        +--------------------------+----------------------------------------------------------------------------------+
        | character_set_client | utf8 |
        | character_set_connection | utf8 |
        | character_set_database | utf8 |
        | character_set_results | utf8 |
        | character_set_server | utf8 |
        | character_set_system | utf8 |
        | character_sets_dir | /usr/local/mysql-standard-4.1.18-apple-darwin8.2.0-powerpc/share/mysql/charsets/ |
        +--------------------------+----------------------------------------------------------------------------------+
        7 rows in set (0.00 sec)

        mysql>

        *** Running my script gives the same 'latin-1' codec error

        brett$ python encoding-hack.py
        latin1
        Traceback (most recent call last):
        File "encoding-hack.py", line 22, in ?
        cursor.execute(u"insert into documents(authors) values(%s)", value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
        self.errorhandler(self, exc, value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
        raise errorclass, errorvalue
        UnicodeEncodeError: 'latin-1' codec can't encode character u'\uc481' in position 0: ordinal not in range(256)

        *** (2) Adding db.charset = 'utf8' -- now gives an 'ascii' codec error
        (Note that the name of the encoding in Python is actually 'utf-8', but using this instead doesn't make any difference)

        felix:~/Thesis/ExtractionCode brett$ python encoding-hack.py utf8Traceback (most recent call last):
        File "encoding-hack.py", line 22, in ?
        cursor.execute(u"insert into documents(authors) values(%s)", value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
        self.errorhandler(self, exc, value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
        raise errorclass, errorvalue
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)

        *** (3) Adding read_default_file option, using the same configuration file as above

        Changing the connect to:

        db = MySQLdb.connect(host="localhost", user="user", passwd='pass', db='mydb',
        use_unicode=True, read_default_file='/etc/my.cnf')

        Gives:

        brett$ python encoding-hack.py
        utf8
        Traceback (most recent call last):
        File "encoding-hack.py", line 22, in ?
        cursor.execute(u"insert into documents(authors) values(%s)", value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
        self.errorhandler(self, exc, value)
        File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
        raise errorclass, errorvalue
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)

         
        • Andy Dustman

          Andy Dustman - 2006-02-26

          Try (value,) instead of value. execute takes a tuple of values for the second parameter.

           
          • Brett Powley

            Brett Powley - 2006-02-26

            OK, I changed this line:

            cursor.execute(u"insert into documents(authors) values(%s)", (value))

            and I still get:

            UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)

            Interestingly, 'forcing' utf8 (by adding the configuration file, defaults file, etc) gives an 'ascii' codec error rather than a 'latin1' error.

             
            • Andy Dustman

              Andy Dustman - 2006-02-26

              That's not what I told you to do. (value,) not (value)

               
              • Brett Powley

                Brett Powley - 2006-02-26

                Hi,

                Tried that too (I actually tried both before):

                cursor.execute(u"insert into documents(authors) values(%s)", (value,))

                Same result:

                File "encoding-hack.py", line 21, in ?
                cursor.execute(u"insert into documents(authors) values(%s)", (value,))
                File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/cursors.py", line 137, in execute
                self.errorhandler(self, exc, value)
                File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
                raise errorclass, errorvalue
                UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 1: ordinal not in range(128)

                 
                • Andy Dustman

                  Andy Dustman - 2006-02-26

                  I have a unit test that inserts a wide range of unicode data, i.e.

                  ''.join([ unichr(row134+i13) for i in range(255) ])

                  where row varies from 0 to 100 and the column is defined as CHAR(255) CHARACTER SET UTF8 (also tried with TEXT) and I cannot replicate your error (also using MySQL-4.1). Add a bug to the tracker and include a test case that reproduces the problem. Remove any of your charset hackery (just use use_unicode=True) and print the connection's charset. Include a minimal table schema. Set default-character-set in a configuration file and reference it with read_default_file on connect(). Include a literal unicode string that fails.

                   
    • Brett Powley

      Brett Powley - 2006-02-26

      and another thing....

      I did try changing this line in connections.py

          self.charset = self.character_set_name().split('_')[0]
      

      to

          self.charset = 'utf8'
      

      but that doesn't work either (ascii codec error).

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.