Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML
Brought to you by:
ptmcg
From: Geoff J. <fo...@ju...> - 2013-01-15 21:01:30
|
With the following: ----- from pyparsing import * html = EXAMPLE_HTML_FROM_PREVIOUS_POST word = Word(printables) word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) text = word for w, s, e in text.scanString(html): print '%s between %s and %s' %(w, s, e, html[s:e]) ----- I get (with some ommisions): ----- .... ['I'] between 2443 and 2444 [From HTML: I] ['have'] between 2445 and 2449 [From HTML: have] ['so'] between 2450 and 2452 [From HTML: so] ['many'] between 2453 and 2457 [From HTML: many] ['National'] between 2479 and 2487 [From HTML: National] ['Geographic</span>’s'] between 2488 and 2513 [From HTML: Geographic</span>’s] ['at'] between 2514 and 2516 [From HTML: at] ['home.</p>'] between 2517 and 2526 [From HTML: home.</p>] ----- Which is a *great* start. From here, if I could: 1) Suppress any HTML tags in the string 2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash etc) and convert those to space, otherwise convert the entity to UTF8. Then I'd be good to go I think! I can then use the word-boundaries to inject the tags, and use the parsed string for my secondary process (which I need a UTF8 string for). |