Dear PyParser Experts
I am trying to scrape a lot of data from the European Parliament
website for a research project. The first step is to create a list of
all parliamentarians, however due to the many Eastern European names
and the accents they use i get a lot of missing entries. Here is an
example of what is giving me troubles (notice the accents at the end
of the family name):
<td class="listcontentlight_left">
<a href="/members/expert/alphaOrder/view.do?
language=EN&id=28276" title="ANDRIKIENĖ, Laima
Liucija">ANDRIKIENĖ, Laima Liucija</a>
<br/>
Group of the European People's Party (Christian Democrats)
<br/>
</td>
Here is the url from which the html example is taken from:
http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN
So far I have been using PyParser and the following code (I know about
hyphens and so forth this is just a test to see if I can get the name
listed above):
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
However this does not catch the name from the html above. Any advice
in how to proceed?
Best, Thomas
P.S: Here is all the code i have so far:
# -*- coding: utf-8 -*-
import urllib.request
from pyparsing_py3 import *
page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN
")
page = page.read().decode("utf8")
#parser_names
name = Word(alphanums + alphas8bit)
begin, end = map(Suppress, "><")
names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end
for name in names.searchString(page):
print(name)
|