Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-05-29 11:45:06
|
Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra whitespace <http://sourceforge.net/tracker/index.php?func=detail&aid=1496863&group_id=24399&atid=381399> so you can track it. Derrick Jay Kim wrote: > Hi, > > I’m trying to get the word count using htmlparser, but it doesn’t seem > to be able to handle the following example. > > Let’s say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color='red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you’ll see the word ‘AAAAA’ three > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second ‘AAAAA’ into two words because of the > font tag in the middle. And, the word count from the parsed text would > be “2”. > > Is there any way that I can get the same text/string/word that I see > on the browser? > > Thanks, > > Jay > |