[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_Finding_a_whole_word?=

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Derrick,

Thank you so much for your quick respond, and getting back to me with
the solution.
Now that I'm able to count the number of words appears in a HTML file
correctly, my next task is to find out the offset (start position) of
each words. I'm guessing that I probably have to use NodeVisitor with
StringBean, but I'd like to get some guidelines before I dig into the
APIs.
So, for the following sample HTML:

<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>

If I search for 'AAAAA', I want to get three matches with their starting
positions (offsets), such as,
	Match 1 offset =3D 58
	Match 2 offset =3D 70
	Match 3 offset =3D 108

Could you show me how to achieve this?
Thanks a lot,
=20
Jay
=20

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Monday, May 29, 2006 4:45 AM
To: htm...@li...
Subject: Re: [Htmlparser-user] Finding a whole word

Jay
The text you want can be obtained with the StringBean if Collapse is
false.

When collapse is true, there is a bug in the StringBean.
I've logged this as bug #1496863 StringBean collapse() adds extra=20
whitespace=20
<http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr=
oup_
id=3D24399&atid=3D381399>=20
so you can track it.
Derrick

Jay Kim wrote:

> Hi,
>
> I'm trying to get the word count using htmlparser, but it doesn't seem

> to be able to handle the following example.
>
> Let's say the source html looks like this:
>
> <HTML>
>
> <head>
>
> <title>Test HTML</title>
>
> </head>
>
> <body>
>
> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
>
> </body>
>
> </HTML>
>
> And, if you load it in a browser, you'll see the word 'AAAAA' three=20
> times.
>
> But, if you parse this html, it returns following nodes:
>
> AAAAA BBBBB AAA AA BBBBB AAAAA
>
> So, it breaks down the second 'AAAAA' into two words because of the=20
> font tag in the middle. And, the word count from the parsed text would

> be "2".
>
> Is there any way that I can get the same text/string/word that I see=20
> on the browser?
>
> Thanks,
>
> Jay
>

-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user