[Pyparsing] HTML Injection OR Word boundary detection from HTML
Brought to you by:
ptmcg
From: Geoff J. <fo...@ju...> - 2013-01-15 19:36:28
|
Hi, First - Sorry for the long email and lack of PyParsing example code. I'm trying to modify some HTML, wrapping 'words' in 'MARK' tags. I've tried BeautifulSoup, HTMLParser, and Regex's, all with limited success. I think PyParsing is the right solution - all the other solutions are more for scraping/extracting data from HTML. I hate asking questions without some code, but I'm so new tto PyParsing that I really am not sure where to start. My gut tells me it's the right tool for the job though. Can anyone help me? Take the following HTML as an example: ----- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US"> <head> <title>Some Book</title> <link rel="stylesheet" type="application/vnd.adobe-page-template+xml" href="page-template.xpgt"/> <style> .italic {font-style: italic;} .bold {font-style: bold;} </style> </head> <body class="text" id="text"> <div class="chapter" id="ch02"> <div class="chapterHead"> <h2 class="chapterNumber"><span class="bold">S</span>ome Book</h2> </div> <div class="chapterBody"> <p class="para">I have so many <span class="italic">National Geographic</span>’s at home.</p> </div> </div> </body> </html> ----- There are 2 lines of interest: ----- <p class="para">I have so many <span class="italic">National Geographic</span>’s at home.</p> <h2 class="chapterNumber"><span class="bold">S</span>ome Book</h2> ----- I am tring to wrap the 'words' in 'MARK' tags. So my 'perfect' result would be: ----- <h2 class="chapterNumber"><mark id='1'><span class="bold">S</span>ome</mark> <mark id='1'>Book</mark></h2> <p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark id='3'>so</mark> <mark id='4'>many</mark> <mark id='5'><span class="italic">National</span></mark> <mark id='6'><span class="italic">Geographic</span>’s</mark> <mark id='7'>at</mark> <mark id='8'>home</mark>.</p> ----- Now there is obviously some complexity in there, over and above the 'mark' injection. For example, the word "Geographic's" is split by a close-span, which started before the word 'National'. So the formatting is 'replayed' during the injection. There are also some differences in the 'MARK' location - Sometimes 'tight' to the word, sometimes with 'SPAN' tags inside. I don't expect PyParser to be able to do that for me (I would love it if it could!) and so I am happy to have PyParser generate 'broken' HTML, that I will fix-up post-process. So the following output would be acceptible: ----- <h2 class="chapterNumber"><span class="bold"><mark id='1'>S</span>ome</mark> <mark id='1'>Book</mark></h2> <p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark id='3'>so</mark> <mark id='4'>many</mark> <span class="italic"><mark id='5'>National</mark> <mark id='6'>Geographic</span>’s</mark> <mark id='7'>at</mark> <mark id='8'>home</mark>.</p> ----- Note that the "National Geographics's" are now 'broken'. A 'Word' can be described as: Any text that is terminated with a space or Punctuation, but excluding the terminator. An added complexity in my full-file is that Quote marks could also terminate a word, but only if it's not an apostrophe (e.g. "I'm excited" has 2 words (I'm, Excited). And Quotes could be HTMLEntities. But again, I am happy to deal with that post-process. An acceptable alternative would be for PyParser to return the start and end locations of 'whole' words (taking into account any interspersed HTML like the close-span in Geographic's) then I can 'shuffle' the Mark tag injection post-process. Again, I'm sorry for not posting example code - I'm still wrapping my head around how PyParser works. So if anyone can give me pointers, I'm happy to do the legwork myself! I'm going to spend all day trying to work this out. If I can get the start and end locations of 'whole' words (taking into account any interspersed HTML like the close-span in Geographic's) then I can 'shuffle' the Mark tag injection post-process. Many thanks in advance, Geoff |