Hi,
=20
I'm trying to get the word count using htmlparser, but it doesn't seem
to be able to handle the following example.
Let's say the source html looks like this:
=20
<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>
=20
And, if you load it in a browser, you'll see the word 'AAAAA' three
times.=20
But, if you parse this html, it returns following nodes:
=20
AAAAA BBBBB AAA AA BBBBB AAAAA
=20
So, it breaks down the second 'AAAAA' into two words because of the font
tag in the middle. And, the word count from the parsed text would be
"2".
Is there any way that I can get the same text/string/word that I see on
the browser?
=20
Thanks,
=20
Jay
=20
|