Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

With the following:

-----
from pyparsing import *
html = EXAMPLE_HTML_FROM_PREVIOUS_POST
word = Word(printables)
word.ignore(anyOpenTag)
word.ignore(anyCloseTag)
word.ignore(commonHTMLEntity)

text = word
for w, s, e in text.scanString(html):
    print '%s between %s and %s' %(w, s, e, html[s:e])
-----

I get (with some ommisions):

-----
....
['I'] between 2443 and 2444 [From HTML: I]
['have'] between 2445 and 2449 [From HTML: have]
['so'] between 2450 and 2452 [From HTML: so]
['many'] between 2453 and 2457 [From HTML: many]
['National'] between 2479 and 2487 [From HTML: National]
['Geographic</span>&rsquo;s'] between 2488 and 2513 [From HTML:
Geographic</span>&rsquo;s]
['at'] between 2514 and 2516 [From HTML: at]
['home.</p>'] between 2517 and 2526 [From HTML: home.</p>]
-----

Which is a *great* start. From here, if I could:

1) Suppress any HTML tags in the string
2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash
etc) and convert those to space, otherwise convert the entity to UTF8.

Then I'd be good to go I think! I can then use the word-boundaries to inject
the tags, and use the parsed string for my secondary process (which I need a
UTF8 string for).