MySQL for Python / Discussion / Help: Data Truncation with Unicode (UTF-8)?

Leonard Lin - 2006-05-23

I'm running the latest release of MySQLdb, 1.2.1_p2 on FreeBSD 4.11 with Python 2.4.1 and MySQL 4.1.18.

Show variables gives me 'utf8' for all the character_set variables (client, connection, database, results, server, and system) and my [mysql] default_character_set is also utf8

Now, what I can't figure out is while I can SELECT and read out UTF-8 fine, that I can't seem to UPDATE UTF-8.

I open my connection w/ charset='utf8', use_unicode=True and then call conn.set_character_set('utf8'), and cursor.execute("set character set utf8") and cursor.execute("set names utf8") and conn.charset = "utf8") to no avail. When I preint the conn.character_set_name(), it says 'latin1'

And while I can send a unicode object, it gives me the following:

File "mysql/MySQLdb/cursors.py", line 146, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 44-45: ordinal not in range(256)

When I try to force the charset to UTF8, I get a truncation data:

./04-make-updates.py:52: Warning: Data truncated for column 'name' at row 1
cursor.execute(sql)

It seems pretty obvious that I just need to figure a way to get the connection out of latin 1 and set as utf8, but I'm not understanding why it's not being set properly.

This problem seems to be similar to this person's: http://woss.name/2005/11/04/dire-not-understanding-of-unicode/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andy Dustman - 2006-05-23
  
  You missed this comment:
  
  http://woss.name/2005/11/04/dire-not-understanding-of-unicode/#comment-157
  
  If you are getting a truncation error, that is because utf8 is a multi-byte, variable-length character set, and your data has more bytes than it does characters.
  
  Keep it simple, use connect(use_unicode=True, charset="utf8").
  
  Note that the default my.cnf file is NOT read by default. I like to create a local config file and read that with connect(read_default_file="/path")
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Leonard Lin - 2006-05-23
  
  Andy, I really did try every single way to force it, including use_unicode and charset, to no avail. (also, when I tried loading a my.cnf, I was passing a read_default_file on connect):
  
  <pre>
  
  UTF8 writing cursor
  
  conn = MySQLdb.connect(..., charset='utf8', use_unicode=True)
  conn.set_character_set('utf8')
  
  cursor = conn.cursor()
  cursor.execute("set character set 'utf8'")
  cursor.execute('set names utf8')
  conn.charset = "utf-8"
  conn.commit()
  
  print conn.character_set_name()
  sys.exit()
  </pre>
  
  It seems like something should work, right? I even tried recompiling the _mysql.c file replacing the default latin1 w/ utf8, but still got nothing on the character_set_name(). I saw a post in the MySQL docs on a bug w/ character_set_name() not being fixed until 4.1.19, but that was with mixing up collations...
  
  I'm really stumped. The particular project I was working on was something that I could do in a batch format, so I just rewrote the code to output the UPDATES into a sql file and was able to load it in via the mysqlclient w/o problems.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Andy Dustman - 2006-05-23
    
    connect(charset='utf8') internally does conn.set_character_set('utf8') if the default character set is not the one you specified. Executing SQL to set the character set won't work properly.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Leonard Lin - 2006-05-23
  
  Worth mentioning that I started off with connect(use_unicode=True, charset="utf8") and that didn't work...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Andy Dustman - 2006-05-23
    
    Define "didn't work". The truncation error actually indicates that it did work, i.e. the encoded string became too long.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Leonard Lin - 2006-05-23
      
      So, by "didn't work," I mean that when I create a connection w/ connect(charset='utf8', use_unicode=T
      rue), conn.character_set_name() returns latin1, not utf8 and that as a result it crushes the utf8 that I'm submitting.
      
      Should this work? I've removed all the other calls. If character_set_name() should return utf8 and there aren't any clear steps I can try, I'll just have to follow up later when I have more time to see what's going on.
      
      Thanks for your help, either way, it's definitely appreciated.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Andy Dustman - 2006-05-23
        
        Try this in your command-line client:
        
        show character set like '%'; # or
        show character set like 'utf%';
        
        and make sure utf8 is actually built into your server.
        
        Example with default character set of utf8:
        
        Python 2.4.3 (#1, May 8 2006, 15:15:48)
        [GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import MySQLdb
        >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
        >>> db.character_set_name()
        'utf8'
        >>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
        >>> db.character_set_name() 'latin1'
        >>> db.set_character_set('utf8')
        >>> db.character_set_name()
        'utf8'
        >>> db.get_server_info()
        '5.0.21-debug-log'
        
        Example with default character set of latin1 (in .my.cnf):
        
        Python 2.4.3 (#1, May 8 2006, 15:15:48)
        [GCC 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import MySQLdb
        >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
        >>> db.character_set_name()
        'latin1'
        >>> db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")
        >>> db.character_set_name()
        'utf8'
        >>> db.set_character_set('latin1')
        >>> db.character_set_name()
        'latin1'
        >>>
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Leonard Lin - 2006-05-23
        
        show character set like 'utf%' gives me a listing for utf8
        
        Here's my output:
        
        <pre>
        $ cat ~/.my.cnf
        [mysql]
        default_character_set = utf8
        
        $ python
        Python 2.4.1 (#2, Aug 12 2005, 18:39:37)
        [GCC 2.95.3 20010315 (release)] on freebsd4
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import sys
        >>> sys.path.append('mysql')
        >>> import MySQLdb
        >>> db=MySQLdb.connect(use_unicode=True,read_default_file="~/.my.cnf")
        >>> db.character_set_name()
        'latin1'
        >>> db=MySQLdb.connect(use_unicode=True,charset="latin1",read_default_file="~/.my.cnf")
        >>> db.character_set_name()
        'latin1'
        >>> db.set_character_set('utf8')
        >>> db.character_set_name()
        'latin1'
        >>> db.get_server_info()
        '4.1.18-Yahoo-SMP-log'
        >>>
        </pre>
        
        Could this be a MySQL bug? Is there something wrong w/ my .my.cnf?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Andy Dustman - 2006-05-23
        
        Isn't it supposed to be (in .my.cnf):
        
        default-character-set = utf8
        
        (i.e. dashes not underscores)
        
        ?
        
        Also, can you try:
        
        db=MySQLdb.connect(use_unicode=True,charset="utf8",read_default_file="~/.my.cnf")
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Leonard Lin - 2006-05-24
        
        I tried both of those adding charset="utf8" before and after changing dashes in my ~/.my.conf file - still stuck on latin1.
        
        But, your comment on the cnf file made me take a second look... and setting [mysql] to [client] did the trick in changing the default. I still can't change the characterset with db.character_set_name() but if anyone is having the problems I am (it must be specific to either my Python or MySQL build) hopefully this thread can help.
        
        What's strange is in Perl, calling $db->do("SET NAMES 'utf8'") works without a hitch.
        
        Thanks again for your assistance in chasing this down.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Data Truncation with Unicode (UTF-8)?

MySQL database connector for Python programming

Forums

Help

Data Truncation with Unicode (UTF-8)?

UTF8 writing cursor

Data Truncation with Unicode (UTF-8)?

MySQL database connector for Python programming

Forums

Help

Data Truncation with Unicode (UTF-8)? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

UTF8 writing cursor

Data Truncation with Unicode (UTF-8)?