I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)
What I'd like is something as simple as:
CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
import MySQLdb, re,urllib
data = urllib.urlopen('http://localhost/test.html').read()
where data2 is somehow the UTF-8 converted version of the original Web page.
Additionally, I'd like to be able to do:
body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
data = urllib.urlopen('http://localhost/test.html').read()
main_body = body_expr.search(data).group(1)
and insert that into the database, and most likely I need to
I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
2) Convert to unicode before UTF-8
3) replace quotation marks within the SQL statement: data2.replace(u'"',u'\\"')
etc., etc., but after numerous tries in the end I still keep getting either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.
Can someone give me any explanation of how to do this easily? (5 line example would be great)
PS
Note that I am able to do create Unicode data and insert it
with a carefully controlled unicode string
data = u"Make \u0633\u0644\u0627\u0645, not war"
c.execute ( INSERT INTO junk (junklet) VALUES ('%s') ''' % data.encode('utf-8','ignore')
but this won't work with what I find on the Web.
Thanks,
Bill
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)
What I'd like is something as simple as:
CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
import MySQLdb, re,urllib
data = urllib.urlopen('http://localhost/test.html').read()
data2 = ???
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
where data2 is somehow the UTF-8 converted version of the original Web page.
Additionally, I'd like to be able to do:
body_expr = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
data = urllib.urlopen('http://localhost/test.html').read()
main_body = body_expr.search(data).group(1)
and insert that into the database, and most likely I need to
I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
2) Convert to unicode before UTF-8
3) replace quotation marks within the SQL statement: data2.replace(u'"',u'\\"')
etc., etc., but after numerous tries in the end I still keep getting either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.
Can someone give me any explanation of how to do this easily? (5 line example would be great)
PS
Note that I am able to do create Unicode data and insert it
with a carefully controlled unicode string
data = u"Make \u0633\u0644\u0627\u0645, not war"
c.execute ( INSERT INTO junk (junklet) VALUES ('%s') ''' % data.encode('utf-8','ignore')
but this won't work with what I find on the Web.
Thanks,
Bill