Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2013-01-16 04:11:57
|
Geoff - Congratulations on your first steps with pyparsing. You have found scanString and how it returns the start and end locations of each match. Pyparsing also includes transformString which is a wrapper around scanString to do the kind of injection function you are doing. transformString applies all parse actions that can modify or enhance the parsed strings by returning a different string than the one passed in in the tokens argument. See how I've added a parse action to a slightly different version of your word expression: word = Word(alphas, printables,excludeChars='<>&') word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) tagnum = 0 def addMarkTags(tokens): global tagnum tagnum += 1 return "<mark id='%d'>%s</mark>" % (tagnum, tokens[0]) word.setParseAction(addMarkTags) print word.transformString(html) This will print: <h2 class="chapterNumber"><span class="bold"><mark id='1'>S</mark></span><mark id='2'>ome</mark> <mark id='3'>Book</mark></h2> <p class="para"><mark id='4'>I</mark> <mark id='5'>have</mark> <mark id='6'>so</mark> <mark id='7'>many</mark> <span class="italic"><mark id='8'>National</mark> <mark id='9'>Geographic</mark></span>&<mark id='10'>rsquo;s</mark> <mark id='11'>at</mark> <mark id='12'>home.</mark></p> I think transformString is the avenue to follow for this project. -- Paul -----Original Message----- From: Geoff Jukes [mailto:fo...@ju...] Sent: Tuesday, January 15, 2013 3:01 PM To: pyp...@li... Subject: Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML With the following: ----- from pyparsing import * html = EXAMPLE_HTML_FROM_PREVIOUS_POST word = Word(printables) word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) text = word for w, s, e in text.scanString(html): print '%s between %s and %s' %(w, s, e, html[s:e]) ----- I get (with some ommisions): ----- .... ['I'] between 2443 and 2444 [From HTML: I] ['have'] between 2445 and 2449 [From HTML: have] ['so'] between 2450 and 2452 [From HTML: so] ['many'] between 2453 and 2457 [From HTML: many] ['National'] between 2479 and 2487 [From HTML: National] ['Geographic</span>’s'] between 2488 and 2513 [From HTML: Geographic</span>’s] ['at'] between 2514 and 2516 [From HTML: at] ['home.</p>'] between 2517 and 2526 [From HTML: home.</p>] ----- Which is a *great* start. From here, if I could: 1) Suppress any HTML tags in the string 2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash etc) and convert those to space, otherwise convert the entity to UTF8. Then I'd be good to go I think! I can then use the word-boundaries to inject the tags, and use the parsed string for my secondary process (which I need a UTF8 string for). ---------------------------------------------------------------------------- -- Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS and more. Get SQL Server skills now (including 2012) with LearnDevNow - 200+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only - learn more at: http://p.sf.net/sfu/learnmore_122512 _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |