Menu

Data Truncation with Unicode (UTF-8)?

Help
2006-05-23
2012-09-19
  • Leonard Lin

    Leonard Lin - 2006-05-23

    I'm running the latest release of MySQLdb, 1.2.1_p2 on FreeBSD 4.11 with Python 2.4.1 and MySQL 4.1.18.

    Show variables gives me 'utf8' for all the character_set variables (client, connection, database, results, server, and system) and my [mysql] default_character_set is also utf8

    Now, what I can't figure out is while I can SELECT and read out UTF-8 fine, that I can't seem to UPDATE UTF-8.

    I open my connection w/ charset='utf8', use_unicode=True and then call conn.set_character_set('utf8'), and cursor.execute("set character set utf8") and cursor.execute("set names utf8") and conn.charset = "utf8") to no avail. When I preint the conn.character_set_name(), it says 'latin1'

    And while I can send a unicode object, it gives me the following:

    File "mysql/MySQLdb/cursors.py", line 146, in execute
    query = query.encode(charset)
    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 44-45: ordinal not in range(256)

    When I try to force the charset to UTF8, I get a truncation data:

    ./04-make-updates.py:52: Warning: Data truncated for column 'name' at row 1
    cursor.execute(sql)

    It seems pretty obvious that I just need to figure a way to get the connection out of latin 1 and set as utf8, but I'm not understanding why it's not being set properly.

    This problem seems to be similar to this person's: http://woss.name/2005/11/04/dire-not-understanding-of-unicode/

     
    • Andy Dustman

      Andy Dustman - 2006-05-23

      You missed this comment:

      http://woss.name/2005/11/04/dire-not-understanding-of-unicode/#comment-157

      If you are getting a truncation error, that is because utf8 is a multi-byte, variable-length character set, and your data has more bytes than it does characters.

      Keep it simple, use connect(use_unicode=True, charset="utf8").

      Note that the default my.cnf file is NOT read by default. I like to create a local config file and read that with connect(read_default_file="/path")

       
    • Leonard Lin

      Leonard Lin - 2006-05-23

      Andy, I really did try every single way to force it, including use_unicode and charset, to no avail. (also, when I tried loading a my.cnf, I was passing a read_default_file on connect):

      <pre>

      UTF8 writing cursor

      conn = MySQLdb.connect(..., charset='utf8', use_unicode=True)
      conn.set_character_set('utf8')

      cursor = conn.cursor()
      cursor.execute("set character set 'utf8'")
      cursor.execute('set names utf8')
      conn.charset = "utf-8"
      conn.commit()

      print conn.character_set_name()
      sys.exit()
      </pre>

      It seems like something should work, right? I even tried recompiling the _mysql.c file replacing the default latin1 w/ utf8, but still got nothing on the character_set_name(). I saw a post in the MySQL docs on a bug w/ character_set_name() not being fixed until 4.1.19, but that was with mixing up collations...

      I'm really stumped. The particular project I was working on was something that I could do in a batch format, so I just rewrote the code to output the UPDATES into a sql file and was able to load it in via the mysqlclient w/o problems.

       
      • Andy Dustman

        Andy Dustman - 2006-05-23

        connect(charset='utf8') internally does conn.set_character_set('utf8') if the default character set is not the one you specified. Executing SQL to set the character set won't work properly.

         
    • Leonard Lin

      Leonard Lin - 2006-05-23

      Worth mentioning that I started off with connect(use_unicode=True, charset="utf8") and that didn't work...

       
      • Andy Dustman

        Andy Dustman - 2006-05-23

        Define "didn't work". The truncation error actually indicates that it did work, i.e. the encoded string became too long.

         
        • Leonard Lin

          Leonard Lin - 2006-05-23

          So, by "didn't work," I mean that when I create a connection w/ connect(charset='utf8', use_unicode=T
          rue), conn.character_set_name() returns latin1, not utf8 and that as a result it crushes the utf8 that I'm submitting.

          Should this work? I've removed all the other calls. If character_set_name() should return utf8 and there aren't any clear steps I can try, I'll just have to follow up later when I have more time to see what's going on.

          Thanks for your help, either way, it's definitely appreciated.

           
          • Andy Dustman

            Andy Dustman - 2006-05-23

            Try this in your command-line client:

            show character set like '%'; # or
            show character set like 'utf%';

            and make sure utf8 is actually built into your server.

            Example with default character set of utf8:

            Python 2.4.3 (#1, May 8 2006, 15:15:48)
            [GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
            Type "help", "copyright", "credits" or "license" for more information.
            >>> import MySQLdb
            >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
            >>> db.character_set_name()
            'utf8'
            >>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
            >>> db.character_set_name() 'latin1'
            >>> db.set_character_set('utf8')
            >>> db.character_set_name()
            'utf8'
            >>> db.get_server_info()
            '5.0.21-debug-log'

            Example with default character set of latin1 (in .my.cnf):

            Python 2.4.3 (#1, May 8 2006, 15:15:48)
            [GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
            Type "help", "copyright", "credits" or "license" for more information.
            >>> import MySQLdb
            >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
            >>> db.character_set_name()
            'latin1'
            >>> db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")
            >>> db.character_set_name()
            'utf8'
            >>> db.set_character_set('latin1')
            >>> db.character_set_name()
            'latin1'
            >>>

             
            • Leonard Lin

              Leonard Lin - 2006-05-23

              show character set like 'utf%' gives me a listing for utf8

              Here's my output:

              <pre>
              $ cat ~/.my.cnf
              [mysql]
              default_character_set = utf8

              $ python
              Python 2.4.1 (#2, Aug 12 2005, 18:39:37)
              [GCC 2.95.3 20010315 (release)] on freebsd4
              Type "help", "copyright", "credits" or "license" for more information.
              >>> import sys
              >>> sys.path.append('mysql')
              >>> import MySQLdb
              >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
              >>> db.character_set_name()
              'latin1'
              >>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
              >>> db.character_set_name()
              'latin1'
              >>> db.set_character_set('utf8')
              >>> db.character_set_name()
              'latin1'
              >>> db.get_server_info()
              '4.1.18-Yahoo-SMP-log'
              >>>
              </pre>

              Could this be a MySQL bug? Is there something wrong w/ my .my.cnf?

               
              • Andy Dustman

                Andy Dustman - 2006-05-23

                Isn't it supposed to be (in .my.cnf):

                default-character-set = utf8

                (i.e. dashes not underscores)

                ?

                Also, can you try:

                db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")

                 
                • Leonard Lin

                  Leonard Lin - 2006-05-24

                  I tried both of those adding charset="utf8" before and after changing dashes in my ~/.my.conf file - still stuck on latin1.

                  But, your comment on the cnf file made me take a second look... and setting [mysql] to [client] did the trick in changing the default. I still can't change the characterset with db.character_set_name() but if anyone is having the problems I am (it must be specific to either my Python or MySQL build) hopefully this thread can help.

                  What's strange is in Perl, calling $db->do("SET NAMES 'utf8'") works without a hitch.

                  Thanks again for your assistance in chasing this down.

                   

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.