cursor.execute("""
DROP TABLE IF EXISTS person;
""")
cursor.execute("""
CREATE TABLE person ( name text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
""")
cursor.execute("""
INSERT INTO person (name) VALUES
(%s),
(%s);
""", (u'Hugo', u'Hügö'))
connection.commit()
demonstrate the bug
cursor.execute("""
SELECT * FROM person WHERE name=%s;
""", u'Hugo') # same result when specifying 'Hügö'
for row in cursor.fetchall():
print '>', row[0].encode('latin-1')
output on my machine:
> Hugo
> Hügö
---- 8< ----
I cannot figure out how to solve this and would greatly appreciate any help.
My setup is: Windows XP SP2 mysql-essential-5.0.22-win32.msi
* MySQL-python.exe-1.2.1_p2.win32-py2.4.exe
Thank you in advance,
Markus
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Success: I managed to patch the part of SQLObject, which generates the SQL for table creation. For MySQL databases it creates tables like this:
CREATE TABLE person ( name text NOT NULL
) COLLATE utf8_swedish_ci
Additional note: The following utf8 collations treat e.g. 'ü' and 'u' as different characters: utf8_danish_ci utf8_estonian_ci utf8_hungarian_ci utf8_icelandic_ci utf8_roman_ci utf8_swedish_ci
* utf8_turkish_ci
Just for the records the complete test script, which behaves as I want it to do:
cursor.execute("""
DROP TABLE IF EXISTS person
""")
cursor.execute("""
CREATE TABLE person ( name text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_swedish_ci
""")
cursor.executemany("""
INSERT INTO person (name) VALUES
(%s)
""", (u'Hugo', u'Hügö'))
connection.commit()
cursor.execute("""
SELECT * FROM person WHERE name=%s
""", u'hÜGö')
for row in cursor.fetchall():
print '>', row[0].encode('latin-1')
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You should probably not use TEXT for this purpose. TEXT is really a non-BINARY BLOB. VARCHAR is more suitable.
You should insert multiple rows like this:
cursor.executemany("""
INSERT INTO person (name) VALUES
(%s)""", (u'Hugo', u'Hügö'))
Don't use ; as an SQL terminator; it's not necessary and may break things.
Lastly, I guess you think this means it should be insensitive to accents such as umlauts, since you are getting out what you put in, but I don't think this is true. You may want to spend some time reading this section of the MySQL manual.
"""In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German ‘Ö’), and for multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations)."""
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In reply to your remarks: ) TEXT is chosen by an ORM callen SQLObject, not me. ) Thank you for the hints regarding executemany() and ; ) I was expecting that it should be sensitive* to accents such as umlauts, since when I search for something containing e.g. an 'ü' I really do not want to get also results containing an 'u'.
After reading all the documentation below the link you mentioned (thank you), and experimenting quite a lot, I managed to get only the row I really am looking for:
cursor.execute("""
SELECT * FROM person WHERE name=%s COLLATE utf8_bin
""", u'Hugo')
Using the additional 'COLLATE utf8_bin', only 'Hugo' is returned, what is the behavior I am looking for.
However, since I am using SQLObject, which generates the SQL for my, I am looking for a way to set this somewhere outside, and not in every SQL statement. Unfortunately I was not successful.
If you have any suggestion, it would be greate. Otherwise I think I will have to live with this annoyance.
Kind regards,
Markus
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Additional Note: Since SQLObject does not take heed of sending just strings containing %s to execute() and place the unicode parameters into the second argument tuple, it does not work with MySQLdb. I had to change execute() to only call .encode() on the given string if it is really a string and not already a unicode object. Quite akin to the issue mentioned in http://sourceforge.net/forum/forum.php?thread_id=1442563&forum_id=70461
Be well,
Markus
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When a table contains rows with the same text, except for Umlauts, my SELECT statement returns incorrect data:
---- 8< ----
-- coding: latin-1 --
import MySQLdb
connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
cursor = connection.cursor()
set-up our test-data
cursor.execute("""
DROP TABLE IF EXISTS
person
;""")
cursor.execute("""
CREATE TABLE
person
(name
text NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf8;
""")
cursor.execute("""
INSERT INTO
person
(name
) VALUES(%s),
(%s);
""", (u'Hugo', u'Hügö'))
connection.commit()
demonstrate the bug
cursor.execute("""
SELECT * FROM person WHERE name=%s;
""", u'Hugo') # same result when specifying 'Hügö'
for row in cursor.fetchall():
print '>', row[0].encode('latin-1')
output on my machine:
> Hugo
> Hügö
---- 8< ----
I cannot figure out how to solve this and would greatly appreciate any help.
My setup is:
Windows XP SP2
mysql-essential-5.0.22-win32.msi
* MySQL-python.exe-1.2.1_p2.win32-py2.4.exe
Thank you in advance,
Markus
Success: I managed to patch the part of SQLObject, which generates the SQL for table creation. For MySQL databases it creates tables like this:
CREATE TABLE
person
(name
text NOT NULL) COLLATE utf8_swedish_ci
Additional note: The following utf8 collations treat e.g. 'ü' and 'u' as different characters:
utf8_danish_ci
utf8_estonian_ci
utf8_hungarian_ci
utf8_icelandic_ci
utf8_roman_ci
utf8_swedish_ci
* utf8_turkish_ci
Just for the records the complete test script, which behaves as I want it to do:
---- 8< ----
-- coding: latin-1 --
import MySQLdb
connection = MySQLdb.connect(host='localhost', db='test', user='markus', use_unicode=1, charset='utf8')
cursor = connection.cursor()
set-up our test-data
cursor.execute("""
DROP TABLE IF EXISTS
person
""")
cursor.execute("""
CREATE TABLE
person
(name
text NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_swedish_ci
""")
cursor.executemany("""
INSERT INTO
person
(name
) VALUES(%s)
""", (u'Hugo', u'Hügö'))
connection.commit()
cursor.execute("""
SELECT * FROM person WHERE name=%s
""", u'hÜGö')
for row in cursor.fetchall():
print '>', row[0].encode('latin-1')
You may also want to look at your server configuration as you can set the default character set and collation and storage engine there.
I am sorry, the previous URL is wrong. The bug I wanted to refer to is http://sourceforge.net/tracker/index.php?func=detail&aid=1521274&group_id=22307&atid=374932
A couple points:
You should probably not use TEXT for this purpose. TEXT is really a non-BINARY BLOB. VARCHAR is more suitable.
You should insert multiple rows like this:
cursor.executemany("""
INSERT INTO
person
(name
) VALUES(%s)""", (u'Hugo', u'Hügö'))
Don't use ; as an SQL terminator; it's not necessary and may break things.
Lastly, I guess you think this means it should be insensitive to accents such as umlauts, since you are getting out what you put in, but I don't think this is true. You may want to spend some time reading this section of the MySQL manual.
http://dev.mysql.com/doc/refman/5.0/en/charset.html
Particularly the first section says this:
"""In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German ‘Ö’), and for multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations)."""
Thany you for the quick response.
In reply to your remarks:
) TEXT is chosen by an ORM callen SQLObject, not me.
) Thank you for the hints regarding executemany() and ;
) I was expecting that it should be sensitive* to accents such as umlauts, since when I search for something containing e.g. an 'ü' I really do not want to get also results containing an 'u'.
After reading all the documentation below the link you mentioned (thank you), and experimenting quite a lot, I managed to get only the row I really am looking for:
cursor.execute("""
SELECT * FROM person WHERE name=%s COLLATE utf8_bin
""", u'Hugo')
Using the additional 'COLLATE utf8_bin', only 'Hugo' is returned, what is the behavior I am looking for.
However, since I am using SQLObject, which generates the SQL for my, I am looking for a way to set this somewhere outside, and not in every SQL statement. Unfortunately I was not successful.
If you have any suggestion, it would be greate. Otherwise I think I will have to live with this annoyance.
Kind regards,
Markus
Thank you for the hint.
Additional Note: Since SQLObject does not take heed of sending just strings containing %s to execute() and place the unicode parameters into the second argument tuple, it does not work with MySQLdb. I had to change execute() to only call .encode() on the given string if it is really a string and not already a unicode object. Quite akin to the issue mentioned in http://sourceforge.net/forum/forum.php?thread_id=1442563&forum_id=70461
Be well,
Markus