I am parsing a feed using the Universal Feed Parser and I get the following error when inserting into mysql.
=====
Traceback (most recent call last):
File "test.py", line 20, in ?
cursor.execute("""INSERT INTO news (content) VALUES (%s)""", (content))
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in execute
return self._execute(query, args)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in _execute
self.errorhandler(self, exc, value)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 6: ordinal not in range(256)
=====
I am running OSX Panther, Python 2.3, and MySQLdb 1.0. The character encoding that was used to parse the feed is utf-8. This also happens with Fedora.
this happens probably because Universal Feed Parser returns character data as UTF-8 but mysql does not (yet) support unicode (4.1 will include support UTF-8) and so it tries to encode character data into straight 8-bit strings (python's str type) using latin-1 (AFAIK mysql default encoding).
the problem arise when an unicode symbol isn't in the latin-1 table, say a curly quote or a curly apostrophe.
i've faced the same problem in the past and i've just written a python function that perform an unicode --> latin-1 encoding, instead of using mysql built-in converter, forcing the system to forget abut possibile encoding errors:
def safeEncode(self, v): # v is an unicode value
try:
# force encoding to latin-1
v = v.encode('latin-1', 'replace')
except UnicodeEncodeError, ex: #unlikely
print ex #let us known
return v
'replace' will replace trouble charaters with a question mark. you can try to use 'ignore' and see what happens.
so, your query becomes:
cursor.execute("""INSERT INTO news (content) VALUES (%s)""", (safeEncode(content), ))
HTH,
deelan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Perhap the converter for types.UnicodeType should also respect the 'unicode' setting of connection objects. Setting 'unicode' parameter for a connection means programmer wants to read unicode objects from queries, so it's reasonable to expect writing unicode objects into queries works too. I did this and can insert unicode objects without problem:
encoding = 'utf-8'
db = MySQLdb.connect(
db = "test", user = "", passwd = "", unicode = encoding)
def unicode_literal(u, dummy = None) :
""" Unicode converter that respect our encoding. """
return db.literal(u.encode(encoding))
def install_encoding(db) :
""" Install our unicode converter.
Simply passing converter dictionary into initializer isn't going to
work since they will be replaced.
"""
db.converter[types.UnicodeType] = unicode_literal
install_encoding(db)
c = db.cursor()
c.execute("""
INSERT `test_project` VALUES (%s, %s, %s, %s)
""", ("hello", u"some unicode literals", None, datetime.now()))
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am parsing a feed using the Universal Feed Parser and I get the following error when inserting into mysql.
=====
Traceback (most recent call last):
File "test.py", line 20, in ?
cursor.execute("""INSERT INTO news (content) VALUES (%s)""", (content))
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in execute
return self._execute(query, args)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in _execute
self.errorhandler(self, exc, value)
File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
raise errorclass, errorvalue
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 6: ordinal not in range(256)
=====
I am running OSX Panther, Python 2.3, and MySQLdb 1.0. The character encoding that was used to parse the feed is utf-8. This also happens with Fedora.
My MySQL insert code is as follows
try:
cursor.execute("""INSERT INTO news (content) VALUES (%s)""", (content))
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
=====
The error above, happens for example when I set content to the following which was parsed by Universal Feed Parser:
Arthur’s vade mecum - on Competitive Intelligence
or
=====
Let’s all be individuals…now, everybody repeat after me, ‘Let’s all be individuals’
=====
I realize this is an encoding issue, but do not know how to handle it. Any ideas are appreciated.
oops, safeEncode() indentation is gone. check this instead:
http://pastebin.de/pastebin.py?id=1278
Thank you deelan it works great.
I happen to be storing data captured by the universal feed parser for search purposes, thus the ignore option is slightly better than replace.
this happens probably because Universal Feed Parser returns character data as UTF-8 but mysql does not (yet) support unicode (4.1 will include support UTF-8) and so it tries to encode character data into straight 8-bit strings (python's str type) using latin-1 (AFAIK mysql default encoding).
the problem arise when an unicode symbol isn't in the latin-1 table, say a curly quote or a curly apostrophe.
i've faced the same problem in the past and i've just written a python function that perform an unicode --> latin-1 encoding, instead of using mysql built-in converter, forcing the system to forget abut possibile encoding errors:
def safeEncode(self, v): # v is an unicode value
try:
# force encoding to latin-1
v = v.encode('latin-1', 'replace')
except UnicodeEncodeError, ex: #unlikely
print ex #let us known
'replace' will replace trouble charaters with a question mark. you can try to use 'ignore' and see what happens.
so, your query becomes:
cursor.execute("""INSERT INTO news (content) VALUES (%s)""", (safeEncode(content), ))
HTH,
deelan
Perhap the converter for types.UnicodeType should also respect the 'unicode' setting of connection objects. Setting 'unicode' parameter for a connection means programmer wants to read unicode objects from queries, so it's reasonable to expect writing unicode objects into queries works too. I did this and can insert unicode objects without problem: