stripping html tags when using pyparsing

MickH
2006-07-19
2013-05-14
  • MickH

    MickH - 2006-07-19

    Hallo

    Could anyone assist a newbie please -

    I have created a program based on Paul Maguire's example in 'Building recursive descent parsers with python'

    -
    from pyparsing import *
    import urllib

    # define basic text patterns for search

    skipA = Literal('25px">')
    searchString= Word(alphas ) + 'to:'
    tdStart = Literal('"routeTitle">').suppress()
    #tdEnd = Literal("</TBODY>").suppress()
    tdEnd = Literal("<!--==").suppress()
    aa = Literal('<TR>').suppress()

    parama =  tdStart + searchString.setResultsName("from")+ SkipTo(tdEnd).setResultsName("details") + tdEnd

    # get list of Routes & prices (+ Loads of other stuff)

    aUrl = "http://www.aerarann.ie/"
    aPage = urllib.urlopen( aUrl )
    aListHTML = aPage.read()
    aPage.close()

    for srvrtokens,startloc,endloc in parama.scanString( aListHTML ):
        print 'DETAILS EXTRACTED : ',"%(from)-15s : %(details)20s" % srvrtokens

    The code works but brings back all of the html tags (except those I search on), when all I want to bring back is the embedded data.

    Also the data occurs in several blocks, but if I put the 'tdEnd' parameter to a field which terminates each block, It only bring back the first block. So I had to set the tdEnd to '<T BODY>' which only occurrs at the end of the page
    (which brings back all of the data).

    Any assistance would be gratefully received.

    Thanks

    Mick

     
    • Paul McGuire

      Paul McGuire - 2007-04-13

      Mick -

      The latest release contains the example htmlStripper.py, which might help you (although you have probably moved on with your life...)

      Sorry to not be more responsive,
      -- Paul

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks