Menu

Wrong utf8 rows returned

Help
2006-07-18
2012-09-19
  • Markus Gritsch

    Markus Gritsch - 2006-07-18

    When a table contains rows with the same text, except for Umlauts, my SELECT statement returns incorrect data:

    ---- 8< ----

    -- coding: latin-1 --

    import MySQLdb

    connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
    cursor = connection.cursor()

    set-up our test-data

    cursor.execute("""
    DROP TABLE IF EXISTS person;
    """)
    cursor.execute("""
    CREATE TABLE person (
    name text NOT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    """)
    cursor.execute("""
    INSERT INTO person (name) VALUES
    (%s),
    (%s);
    """, (u'Hugo', u'Hügö'))
    connection.commit()

    demonstrate the bug

    cursor.execute("""
    SELECT * FROM person WHERE name=%s;
    """, u'Hugo') # same result when specifying 'Hügö'
    for row in cursor.fetchall():
    print '>', row[0].encode('latin-1')

    output on my machine:

    > Hugo

    > Hügö

    ---- 8< ----

    I cannot figure out how to solve this and would greatly appreciate any help.

    My setup is:
    Windows XP SP2
    mysql-essential-5.0.22-win32.msi
    * MySQL-python.exe-1.2.1_p2.win32-py2.4.exe

    Thank you in advance,
    Markus

     
    • Markus Gritsch

      Markus Gritsch - 2006-07-19

      Success: I managed to patch the part of SQLObject, which generates the SQL for table creation. For MySQL databases it creates tables like this:

      CREATE TABLE person (
      name text NOT NULL
      ) COLLATE utf8_swedish_ci

      Additional note: The following utf8 collations treat e.g. 'ü' and 'u' as different characters:
      utf8_danish_ci
      utf8_estonian_ci
      utf8_hungarian_ci
      utf8_icelandic_ci
      utf8_roman_ci
      utf8_swedish_ci
      * utf8_turkish_ci

      Just for the records the complete test script, which behaves as I want it to do:

      ---- 8< ----

      -- coding: latin-1 --

      import MySQLdb

      connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
      cursor = connection.cursor()

      set-up our test-data

      cursor.execute("""
      DROP TABLE IF EXISTS person
      """)
      cursor.execute("""
      CREATE TABLE person (
      name text NOT NULL
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_swedish_ci
      """)
      cursor.executemany("""
      INSERT INTO person (name) VALUES
      (%s)
      """, (u'Hugo', u'Hügö'))
      connection.commit()

      cursor.execute("""
      SELECT * FROM person WHERE name=%s
      """, u'hÜGö')
      for row in cursor.fetchall():
      print '>', row[0].encode('latin-1')

       
      • Andy Dustman

        Andy Dustman - 2006-07-19

        You may also want to look at your server configuration as you can set the default character set and collation and storage engine there.

         
    • Markus Gritsch

      Markus Gritsch - 2006-07-20

      I am sorry, the previous URL is wrong. The bug I wanted to refer to is http://sourceforge.net/tracker/index.php?func=detail&aid=1521274&group_id=22307&atid=374932

       
    • Andy Dustman

      Andy Dustman - 2006-07-18

      A couple points:

      You should probably not use TEXT for this purpose. TEXT is really a non-BINARY BLOB. VARCHAR is more suitable.

      You should insert multiple rows like this:

      cursor.executemany("""
      INSERT INTO person (name) VALUES
      (%s)""", (u'Hugo', u'Hügö'))

      Don't use ; as an SQL terminator; it's not necessary and may break things.

      Lastly, I guess you think this means it should be insensitive to accents such as umlauts, since you are getting out what you put in, but I don't think this is true. You may want to spend some time reading this section of the MySQL manual.

      http://dev.mysql.com/doc/refman/5.0/en/charset.html

      Particularly the first section says this:

      """In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German ‘Ö’), and for multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations)."""

       
    • Markus Gritsch

      Markus Gritsch - 2006-07-19

      Thany you for the quick response.

      In reply to your remarks:
      ) TEXT is chosen by an ORM callen SQLObject, not me.
      ) Thank you for the hints regarding executemany() and ;
      ) I was expecting that it should be sensitive* to accents such as umlauts, since when I search for something containing e.g. an 'ü' I really do not want to get also results containing an 'u'.

      After reading all the documentation below the link you mentioned (thank you), and experimenting quite a lot, I managed to get only the row I really am looking for:

      cursor.execute("""
      SELECT * FROM person WHERE name=%s COLLATE utf8_bin
      """, u'Hugo')

      Using the additional 'COLLATE utf8_bin', only 'Hugo' is returned, what is the behavior I am looking for.

      However, since I am using SQLObject, which generates the SQL for my, I am looking for a way to set this somewhere outside, and not in every SQL statement. Unfortunately I was not successful.

      If you have any suggestion, it would be greate. Otherwise I think I will have to live with this annoyance.

      Kind regards,
      Markus

       
    • Markus Gritsch

      Markus Gritsch - 2006-07-20

      Thank you for the hint.

      Additional Note: Since SQLObject does not take heed of sending just strings containing %s to execute() and place the unicode parameters into the second argument tuple, it does not work with MySQLdb. I had to change execute() to only call .encode() on the given string if it is really a string and not already a unicode object. Quite akin to the issue mentioned in http://sourceforge.net/forum/forum.php?thread_id=1442563&forum_id=70461

      Be well,
      Markus

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.