[Pyparsing] Problem with eastern european characters when scraping data from the European Parliamen
Brought to you by:
ptmcg
From: Thomas J. <tho...@eu...> - 2010-06-10 11:40:52
|
Dear PyParser Experts I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name): <td class="listcontentlight_left"> <a href="/members/expert/alphaOrder/view.do? language=EN&id=28276" title="ANDRIKIENĖ, Laima Liucija">ANDRIKIENĖ, Laima Liucija</a> <br/> Group of the European People's Party (Christian Democrats) <br/> </td> Here is the url from which the html example is taken from: http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN So far I have been using PyParser and the following code (I know about hyphens and so forth this is just a test to see if I can get the name listed above): #parser_names name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "><") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end for name in names.searchString(page): print(name) However this does not catch the name from the html above. Any advice in how to proceed? Best, Thomas P.S: Here is all the code i have so far: # -*- coding: utf-8 -*- import urllib.request from pyparsing_py3 import * page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN ") page = page.read().decode("utf8") #parser_names name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "><") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end for name in names.searchString(page): print(name) |