[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_Finding_a_whole_word?=
Brought to you by:
derrickoswald
From: Jay K. <jy...@eq...> - 2006-05-30 17:45:49
|
Derrick, Thank you so much for your quick respond, and getting back to me with the solution. Now that I'm able to count the number of words appears in a HTML file correctly, my next task is to find out the offset (start position) of each words. I'm guessing that I probably have to use NodeVisitor with StringBean, but I'd like to get some guidelines before I dig into the APIs. So, for the following sample HTML: <HTML> <head> <title>Test HTML</title> </head> <body> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> </body> </HTML> If I search for 'AAAAA', I want to get three matches with their starting positions (offsets), such as, Match 1 offset =3D 58 Match 2 offset =3D 70 Match 3 offset =3D 108 Could you show me how to achieve this? Thanks a lot, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Monday, May 29, 2006 4:45 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra=20 whitespace=20 <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr= oup_ id=3D24399&atid=3D381399>=20 so you can track it. Derrick Jay Kim wrote: > Hi, > > I'm trying to get the word count using htmlparser, but it doesn't seem > to be able to handle the following example. > > Let's say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you'll see the word 'AAAAA' three=20 > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second 'AAAAA' into two words because of the=20 > font tag in the middle. And, the word count from the parsed text would > be "2". > > Is there any way that I can get the same text/string/word that I see=20 > on the browser? > > Thanks, > > Jay > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |