Thread: [Pyparsing] HTML Injection OR Word boundary detection from HTML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

First - Sorry for the long email and lack of PyParsing example code.

I'm trying to modify some HTML, wrapping 'words' in 'MARK' tags. I've tried
BeautifulSoup, HTMLParser, and Regex's, all with limited success. I think
PyParsing is the right solution - all the other solutions are more for
scraping/extracting data from HTML.

I hate asking questions without some code, but I'm so new tto PyParsing that
I really am not sure where to start. My gut tells me it's the right tool for
the job though. Can anyone help me?

Take the following HTML as an example:

-----

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">

                <head>

                                <title>Some Book</title>

                                <link rel="stylesheet"
type="application/vnd.adobe-page-template+xml" href="page-template.xpgt"/>

                                <style>

                                                .italic {font-style:
italic;}

                                                .bold {font-style: bold;}

                                </style>

                </head>

                <body class="text" id="text">

                                <div class="chapter" id="ch02">

                                                <div class="chapterHead">

                                                                <h2
class="chapterNumber"><span class="bold">S</span>ome Book</h2>

                                                </div>

                                                <div class="chapterBody">

                                                                <p
class="para">I have so many <span class="italic">National
Geographic</span>&rsquo;s at home.</p>

                                                </div>

                                </div>

                </body>

</html>

-----

There are 2 lines of interest:

-----

<p class="para">I have so many <span class="italic">National
Geographic</span>&rsquo;s at home.</p>

<h2 class="chapterNumber"><span class="bold">S</span>ome Book</h2>

-----

I am tring to wrap the 'words' in 'MARK' tags. So my 'perfect' result would
be:

-----

<h2 class="chapterNumber"><mark id='1'><span class="bold">S</span>ome</mark>
<mark id='1'>Book</mark></h2>

<p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark
id='3'>so</mark> <mark id='4'>many</mark> <mark id='5'><span
class="italic">National</span></mark> <mark id='6'><span
class="italic">Geographic</span>&rsquo;s</mark> <mark id='7'>at</mark> <mark
id='8'>home</mark>.</p>

-----

Now there is obviously some complexity in there, over and above the 'mark'
injection. For example, the word "Geographic's" is split by a close-span,
which started before the word 'National'. So the formatting is 'replayed'
during the injection. There are also some differences in the 'MARK' location
- Sometimes 'tight' to the word, sometimes with 'SPAN' tags inside.

I don't expect PyParser to be able to do that for me (I would love it if it
could!) and so I am happy to have PyParser generate 'broken' HTML, that I
will fix-up post-process. So the following output would be acceptible:

-----

<h2 class="chapterNumber"><span class="bold"><mark id='1'>S</span>ome</mark>
<mark id='1'>Book</mark></h2>

<p class="para"><mark id='1'>I</mark> <mark id='2'>have</mark> <mark
id='3'>so</mark> <mark id='4'>many</mark> <span class="italic"><mark
id='5'>National</mark> <mark id='6'>Geographic</span>&rsquo;s</mark> <mark
id='7'>at</mark> <mark id='8'>home</mark>.</p>

-----

Note that the "National Geographics's" are now 'broken'. A 'Word' can be
described as: Any text that is terminated with a space or Punctuation, but
excluding the terminator. An added complexity in my full-file is that Quote
marks could also terminate a word, but only if it's not an apostrophe (e.g.
"I'm excited" has 2 words (I'm, Excited). And Quotes could be HTMLEntities.
But again, I am happy to deal with that post-process.

An acceptable alternative would be for PyParser to return the start and end
locations of 'whole' words (taking into account any interspersed HTML like
the close-span in Geographic's) then I can 'shuffle' the Mark tag injection
post-process.

Again, I'm sorry for not posting example code - I'm still wrapping my head
around how PyParser works. So if anyone can give me pointers, I'm happy to
do the legwork myself! I'm going to spend all day trying to work this out.

If I can get the start and end locations of 'whole' words (taking into
account any interspersed HTML like the close-span in Geographic's) then I
can 'shuffle' the Mark tag injection post-process.

Many thanks in advance,

Geoff

Thread: [Pyparsing] HTML Injection OR Word boundary detection from HTML

pyparsing-users