Re: [Htmlparser-user] Finding a whole word

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jay
The text you want can be obtained with the StringBean if Collapse is false.

When collapse is true, there is a bug in the StringBean.
I've logged this as bug #1496863 StringBean collapse() adds extra 
whitespace 
<http://sourceforge.net/tracker/index.php?func=detail&aid=1496863&group_id=24399&atid=381399> 
so you can track it.
Derrick

Jay Kim wrote:

> Hi,
>
> I’m trying to get the word count using htmlparser, but it doesn’t seem 
> to be able to handle the following example.
>
> Let’s say the source html looks like this:
>
> <HTML>
>
> <head>
>
> <title>Test HTML</title>
>
> </head>
>
> <body>
>
> <p>AAAAA BBBBB AAA<font color='red'>AA</font> BBBBB AAAAA</p>
>
> </body>
>
> </HTML>
>
> And, if you load it in a browser, you’ll see the word ‘AAAAA’ three 
> times.
>
> But, if you parse this html, it returns following nodes:
>
> AAAAA BBBBB AAA AA BBBBB AAAAA
>
> So, it breaks down the second ‘AAAAA’ into two words because of the 
> font tag in the middle. And, the word count from the parsed text would 
> be “2”.
>
> Is there any way that I can get the same text/string/word that I see 
> on the browser?
>
> Thanks,
>
> Jay
>