Wrong utf8 rows returned

MySQL database connector for Python programming

Brought to you by: adustman

Wrong utf8 rows returned

Forum: Help

Creator: Markus Gritsch

Created: 2006-07-18

Updated: 2012-09-19

Markus Gritsch - 2006-07-18

When a table contains rows with the same text, except for Umlauts, my SELECT statement returns incorrect data:

---- 8< ----

-- coding: latin-1 --

import MySQLdb

connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
cursor = connection.cursor()

set-up our test-data

cursor.execute("""
DROP TABLE IF EXISTS person;
""")
cursor.execute("""
CREATE TABLE person (
name text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
""")
cursor.execute("""
INSERT INTO person (name) VALUES
(%s),
(%s);
""", (u'Hugo', u'Hügö'))
connection.commit()

demonstrate the bug

cursor.execute("""
SELECT * FROM person WHERE name=%s;
""", u'Hugo') # same result when specifying 'Hügö'
for row in cursor.fetchall():
print '>', row[0].encode('latin-1')

output on my machine:

> Hugo

> Hügö

---- 8< ----

I cannot figure out how to solve this and would greatly appreciate any help.

My setup is:
Windows XP SP2
mysql-essential-5.0.22-win32.msi
* MySQL-python.exe-1.2.1_p2.win32-py2.4.exe

Thank you in advance,
Markus

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Markus Gritsch - 2006-07-19
  
  Success: I managed to patch the part of SQLObject, which generates the SQL for table creation. For MySQL databases it creates tables like this:
  
  CREATE TABLE person (
  name text NOT NULL
  ) COLLATE utf8_swedish_ci
  
  Additional note: The following utf8 collations treat e.g. 'ü' and 'u' as different characters:
  utf8_danish_ci
  utf8_estonian_ci
  utf8_hungarian_ci
  utf8_icelandic_ci
  utf8_roman_ci
  utf8_swedish_ci
  * utf8_turkish_ci
  
  Just for the records the complete test script, which behaves as I want it to do:
  
  ---- 8< ----
  
  -- coding: latin-1 --
  
  import MySQLdb
  
  connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
  cursor = connection.cursor()
  
  set-up our test-data
  
  cursor.execute("""
  DROP TABLE IF EXISTS person
  """)
  cursor.execute("""
  CREATE TABLE person (
  name text NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_swedish_ci
  """)
  cursor.executemany("""
  INSERT INTO person (name) VALUES
  (%s)
  """, (u'Hugo', u'Hügö'))
  connection.commit()
  
  cursor.execute("""
  SELECT * FROM person WHERE name=%s
  """, u'hÜGö')
  for row in cursor.fetchall():
  print '>', row[0].encode('latin-1')
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Andy Dustman - 2006-07-19
    
    You may also want to look at your server configuration as you can set the default character set and collation and storage engine there.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Markus Gritsch - 2006-07-20
  
  I am sorry, the previous URL is wrong. The bug I wanted to refer to is http://sourceforge.net/tracker/index.php?func=detail&aid=1521274&group_id=22307&atid=374932
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andy Dustman - 2006-07-18
  
  A couple points:
  
  You should probably not use TEXT for this purpose. TEXT is really a non-BINARY BLOB. VARCHAR is more suitable.
  
  You should insert multiple rows like this:
  
  cursor.executemany("""
  INSERT INTO person (name) VALUES
  (%s)""", (u'Hugo', u'Hügö'))
  
  Don't use ; as an SQL terminator; it's not necessary and may break things.
  
  Lastly, I guess you think this means it should be insensitive to accents such as umlauts, since you are getting out what you put in, but I don't think this is true. You may want to spend some time reading this section of the MySQL manual.
  
  http://dev.mysql.com/doc/refman/5.0/en/charset.html
  
  Particularly the first section says this:
  
  """In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German ‘Ö’), and for multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations)."""
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Markus Gritsch - 2006-07-19
  
  Thany you for the quick response.
  
  In reply to your remarks:
  ) TEXT is chosen by an ORM callen SQLObject, not me.
  ) Thank you for the hints regarding executemany() and ;
  ) I was expecting that it should be sensitive* to accents such as umlauts, since when I search for something containing e.g. an 'ü' I really do not want to get also results containing an 'u'.
  
  After reading all the documentation below the link you mentioned (thank you), and experimenting quite a lot, I managed to get only the row I really am looking for:
  
  cursor.execute("""
  SELECT * FROM person WHERE name=%s COLLATE utf8_bin
  """, u'Hugo')
  
  Using the additional 'COLLATE utf8_bin', only 'Hugo' is returned, what is the behavior I am looking for.
  
  However, since I am using SQLObject, which generates the SQL for my, I am looking for a way to set this somewhere outside, and not in every SQL statement. Unfortunately I was not successful.
  
  If you have any suggestion, it would be greate. Otherwise I think I will have to live with this annoyance.
  
  Kind regards,
  Markus
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Markus Gritsch - 2006-07-20
  
  Thank you for the hint.
  
  Additional Note: Since SQLObject does not take heed of sending just strings containing %s to execute() and place the unicode parameters into the second argument tuple, it does not work with MySQLdb. I had to change execute() to only call .encode() on the given string if it is really a string and not already a unicode object. Quite akin to the issue mentioned in http://sourceforge.net/forum/forum.php?thread_id=1442563&forum_id=70461
  
  Be well,
  Markus
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.