Thread: [Pyparsing] Problem with eastern european characters when scraping data from the European Parliamen

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear PyParser Experts

I am trying to scrape a lot of data from the European Parliament  
website for a research project. The first step is to create a list of  
all parliamentarians, however due to the many Eastern European names  
and the accents they use i get a lot of missing entries. Here is an  
example of what is giving me troubles (notice the accents at the end  
of the family name):

     <td class="listcontentlight_left">
     <a href="/members/expert/alphaOrder/view.do? 
language=EN&amp;id=28276" title="ANDRIKIENĖ, Laima  
Liucija">ANDRIKIENĖ, Laima Liucija</a>
     <br/>
     Group of the European People's Party (Christian Democrats)
     <br/>
     </td>

Here is the url from which the html example is taken from:
http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN

So far I have been using PyParser and the following code (I know about  
hyphens and so forth this is just a test to see if I can get the name  
listed above):

     #parser_names
     name = Word(alphanums + alphas8bit)
     begin, end = map(Suppress, "><")
     names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

     for name in names.searchString(page):
         print(name)

However this does not catch the name from the html above. Any advice  
in how to proceed?

Best, Thomas

P.S: Here is all the code i have so far:

     # -*- coding: utf-8 -*-

     import urllib.request
     from pyparsing_py3 import *

     page = urllib.request.urlopen("http://www.europarl.europa.eu/members/expert/alphaOrder.do?letter=B&language=EN 
")
     page = page.read().decode("utf8")

     #parser_names
     name = Word(alphanums + alphas8bit)
     begin, end = map(Suppress, "><")
     names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end

     for name in names.searchString(page):
         print(name)

Thread: [Pyparsing] Problem with eastern european characters when scraping data from the European Parliamen

pyparsing-users