Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Geoff -

Congratulations on your first steps with pyparsing. You have found
scanString and how it returns the start and end locations of each match.
Pyparsing also includes transformString which is a wrapper around scanString
to do the kind of injection function you are doing.  transformString applies
all parse actions that can modify or enhance the parsed strings by returning
a different string than the one passed in in the tokens argument.  See how
I've added a parse action to a slightly different version of your word
expression:

    word = Word(alphas, printables,excludeChars='<>&')
    word.ignore(anyOpenTag)
    word.ignore(anyCloseTag)
    word.ignore(commonHTMLEntity)

    tagnum = 0
    def addMarkTags(tokens):
        global tagnum
        tagnum += 1
        return "<mark id='%d'>%s</mark>" % (tagnum, tokens[0])
    word.setParseAction(addMarkTags)

    print word.transformString(html)

This will print:

<h2 class="chapterNumber"><span class="bold"><mark
id='1'>S</mark></span><mark id='2'>ome</mark> <mark id='3'>Book</mark></h2>

<p class="para"><mark id='4'>I</mark> <mark id='5'>have</mark> <mark
id='6'>so</mark> <mark id='7'>many</mark> <span class="italic"><mark
id='8'>National</mark> <mark id='9'>Geographic</mark></span>&<mark
id='10'>rsquo;s</mark> <mark id='11'>at</mark> <mark
id='12'>home.</mark></p>

I think transformString is the avenue to follow for this project.

-- Paul

-----Original Message-----
From: Geoff Jukes [mailto:fo...@ju...] 
Sent: Tuesday, January 15, 2013 3:01 PM
To: pyp...@li...
Subject: Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML

With the following:

-----
from pyparsing import *
html = EXAMPLE_HTML_FROM_PREVIOUS_POST
word = Word(printables)
word.ignore(anyOpenTag)
word.ignore(anyCloseTag)
word.ignore(commonHTMLEntity)

text = word
for w, s, e in text.scanString(html):
    print '%s between %s and %s' %(w, s, e, html[s:e])
-----

I get (with some ommisions):

-----
....
['I'] between 2443 and 2444 [From HTML: I]
['have'] between 2445 and 2449 [From HTML: have]
['so'] between 2450 and 2452 [From HTML: so]
['many'] between 2453 and 2457 [From HTML: many]
['National'] between 2479 and 2487 [From HTML: National]
['Geographic</span>&rsquo;s'] between 2488 and 2513 [From HTML:
Geographic</span>&rsquo;s]
['at'] between 2514 and 2516 [From HTML: at]
['home.</p>'] between 2517 and 2526 [From HTML: home.</p>]
-----

Which is a *great* start. From here, if I could:

1) Suppress any HTML tags in the string
2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash
etc) and convert those to space, otherwise convert the entity to UTF8.

Then I'd be good to go I think! I can then use the word-boundaries to inject
the tags, and use the parsed string for my secondary process (which I need a
UTF8 string for).

----------------------------------------------------------------------------
--
Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS
and more. Get SQL Server skills now (including 2012) with LearnDevNow -
200+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only - learn more at:
http://p.sf.net/sfu/learnmore_122512
_______________________________________________
Pyparsing-users mailing list
Pyp...@li...
https://lists.sourceforge.net/lists/listinfo/pyparsing-users