Thread: [Pyparsing] HTML Injection OR Word boundary detection from HTML
Brought to you by:
ptmcg
From: Geoff J. <fo...@ju...> - 2013-01-15 19:36:28
|
Hi, First - Sorry for the long email and lack of PyParsing example code. I'm trying to modify some HTML, wrapping 'words' in 'MARK' tags. I've tried BeautifulSoup, HTMLParser, and Regex's, all with limited success. I think PyParsing is the right solution - all the other solutions are more for scraping/extracting data from HTML. I hate asking questions without some code, but I'm so new tto PyParsing that I really am not sure where to start. My gut tells me it's the right tool for the job though. Can anyone help me? Take the following HTML as an example: ----- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US"> <head> <title>Some Book</title> <link rel="stylesheet" type="application/vnd.adobe-page-template+xml" href="page-template.xpgt"/> <style> .italic {font-style: italic;} .bold {font-style: bold;} </style> </head> <body class="text" id="text"> <div class="chapter" id="ch02"> <div class="chapterHead"> <h2 class="chapterNumber"><span class="bold">S</span>ome Book</h2> </div> <div class="chapterBody"> <p class="para">I have so many <span class="italic">National Geographic</span>’s at home.</p> </div> </div> </body> </html> ----- There are 2 lines of interest: ----- <p class="para">I have so many <span class="italic">National Geographic</span>’s at home.</p> <h2 class="chapterNumber"><span class="bold">S</span>ome Book</h2> ----- I am tring to wrap the 'words' in 'MARK' tags. So my 'perfect' result would be: ----- <h2 class="chapterNumber"><mark id='1'><span class="bold">S</span>ome</mark> <mark id='1'>Book</mark></h2> <p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark id='3'>so</mark> <mark id='4'>many</mark> <mark id='5'><span class="italic">National</span></mark> <mark id='6'><span class="italic">Geographic</span>’s</mark> <mark id='7'>at</mark> <mark id='8'>home</mark>.</p> ----- Now there is obviously some complexity in there, over and above the 'mark' injection. For example, the word "Geographic's" is split by a close-span, which started before the word 'National'. So the formatting is 'replayed' during the injection. There are also some differences in the 'MARK' location - Sometimes 'tight' to the word, sometimes with 'SPAN' tags inside. I don't expect PyParser to be able to do that for me (I would love it if it could!) and so I am happy to have PyParser generate 'broken' HTML, that I will fix-up post-process. So the following output would be acceptible: ----- <h2 class="chapterNumber"><span class="bold"><mark id='1'>S</span>ome</mark> <mark id='1'>Book</mark></h2> <p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark id='3'>so</mark> <mark id='4'>many</mark> <span class="italic"><mark id='5'>National</mark> <mark id='6'>Geographic</span>’s</mark> <mark id='7'>at</mark> <mark id='8'>home</mark>.</p> ----- Note that the "National Geographics's" are now 'broken'. A 'Word' can be described as: Any text that is terminated with a space or Punctuation, but excluding the terminator. An added complexity in my full-file is that Quote marks could also terminate a word, but only if it's not an apostrophe (e.g. "I'm excited" has 2 words (I'm, Excited). And Quotes could be HTMLEntities. But again, I am happy to deal with that post-process. An acceptable alternative would be for PyParser to return the start and end locations of 'whole' words (taking into account any interspersed HTML like the close-span in Geographic's) then I can 'shuffle' the Mark tag injection post-process. Again, I'm sorry for not posting example code - I'm still wrapping my head around how PyParser works. So if anyone can give me pointers, I'm happy to do the legwork myself! I'm going to spend all day trying to work this out. If I can get the start and end locations of 'whole' words (taking into account any interspersed HTML like the close-span in Geographic's) then I can 'shuffle' the Mark tag injection post-process. Many thanks in advance, Geoff |
From: Geoff J. <fo...@ju...> - 2013-01-15 21:01:30
|
With the following: ----- from pyparsing import * html = EXAMPLE_HTML_FROM_PREVIOUS_POST word = Word(printables) word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) text = word for w, s, e in text.scanString(html): print '%s between %s and %s' %(w, s, e, html[s:e]) ----- I get (with some ommisions): ----- .... ['I'] between 2443 and 2444 [From HTML: I] ['have'] between 2445 and 2449 [From HTML: have] ['so'] between 2450 and 2452 [From HTML: so] ['many'] between 2453 and 2457 [From HTML: many] ['National'] between 2479 and 2487 [From HTML: National] ['Geographic</span>’s'] between 2488 and 2513 [From HTML: Geographic</span>’s] ['at'] between 2514 and 2516 [From HTML: at] ['home.</p>'] between 2517 and 2526 [From HTML: home.</p>] ----- Which is a *great* start. From here, if I could: 1) Suppress any HTML tags in the string 2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash etc) and convert those to space, otherwise convert the entity to UTF8. Then I'd be good to go I think! I can then use the word-boundaries to inject the tags, and use the parsed string for my secondary process (which I need a UTF8 string for). |
From: Paul M. <pt...@au...> - 2013-01-16 04:11:57
|
Geoff - Congratulations on your first steps with pyparsing. You have found scanString and how it returns the start and end locations of each match. Pyparsing also includes transformString which is a wrapper around scanString to do the kind of injection function you are doing. transformString applies all parse actions that can modify or enhance the parsed strings by returning a different string than the one passed in in the tokens argument. See how I've added a parse action to a slightly different version of your word expression: word = Word(alphas, printables,excludeChars='<>&') word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) tagnum = 0 def addMarkTags(tokens): global tagnum tagnum += 1 return "<mark id='%d'>%s</mark>" % (tagnum, tokens[0]) word.setParseAction(addMarkTags) print word.transformString(html) This will print: <h2 class="chapterNumber"><span class="bold"><mark id='1'>S</mark></span><mark id='2'>ome</mark> <mark id='3'>Book</mark></h2> <p class="para"><mark id='4'>I</mark> <mark id='5'>have</mark> <mark id='6'>so</mark> <mark id='7'>many</mark> <span class="italic"><mark id='8'>National</mark> <mark id='9'>Geographic</mark></span>&<mark id='10'>rsquo;s</mark> <mark id='11'>at</mark> <mark id='12'>home.</mark></p> I think transformString is the avenue to follow for this project. -- Paul -----Original Message----- From: Geoff Jukes [mailto:fo...@ju...] Sent: Tuesday, January 15, 2013 3:01 PM To: pyp...@li... Subject: Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML With the following: ----- from pyparsing import * html = EXAMPLE_HTML_FROM_PREVIOUS_POST word = Word(printables) word.ignore(anyOpenTag) word.ignore(anyCloseTag) word.ignore(commonHTMLEntity) text = word for w, s, e in text.scanString(html): print '%s between %s and %s' %(w, s, e, html[s:e]) ----- I get (with some ommisions): ----- .... ['I'] between 2443 and 2444 [From HTML: I] ['have'] between 2445 and 2449 [From HTML: have] ['so'] between 2450 and 2452 [From HTML: so] ['many'] between 2453 and 2457 [From HTML: many] ['National'] between 2479 and 2487 [From HTML: National] ['Geographic</span>’s'] between 2488 and 2513 [From HTML: Geographic</span>’s] ['at'] between 2514 and 2516 [From HTML: at] ['home.</p>'] between 2517 and 2526 [From HTML: home.</p>] ----- Which is a *great* start. From here, if I could: 1) Suppress any HTML tags in the string 2) Check the HTML Entities against a list of 'splits' (e.g. endah, emdash etc) and convert those to space, otherwise convert the entity to UTF8. Then I'd be good to go I think! I can then use the word-boundaries to inject the tags, and use the parsed string for my secondary process (which I need a UTF8 string for). ---------------------------------------------------------------------------- -- Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS and more. Get SQL Server skills now (including 2012) with LearnDevNow - 200+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only - learn more at: http://p.sf.net/sfu/learnmore_122512 _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |