[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_RE:_=5BHtmlparser-user=5D_Fin?= =?us-ascii
Brought to you by:
derrickoswald
From: Jay K. <jy...@eq...> - 2006-05-30 20:15:49
|
Let me describe more on the the problems of using StringBean as a NodeVisitor. Here is my code snippet: private class TestVisitor extends StringBean { @Override public void visitStringNode(Text text) { System.out.println("text=3D" + text.getText()); } } TestVisitor visitor =3D new TestVisitor(); visitor.setCollapse(false); htmlParser.visitAllNodesWith(visitor); And, if I feed the sample HTML below, the visitStringNode() methods does not detect the second 'AAAAA' as one word, but instead, it splits into two words ('AAA' and 'AA'), which is basically the same problem that I described in the first email. Please let me know. Thanks, =20 Jay=20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Jay Kim Sent: Tuesday, May 30, 2006 10:45 AM To: htm...@li... Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word Derrick, Thank you so much for your quick respond, and getting back to me with the solution. Now that I'm able to count the number of words appears in a HTML file correctly, my next task is to find out the offset (start position) of each words. I'm guessing that I probably have to use NodeVisitor with StringBean, but I'd like to get some guidelines before I dig into the APIs. So, for the following sample HTML: <HTML> <head> <title>Test HTML</title> </head> <body> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> </body> </HTML> If I search for 'AAAAA', I want to get three matches with their starting positions (offsets), such as, Match 1 offset =3D 58 Match 2 offset =3D 70 Match 3 offset =3D 108 Could you show me how to achieve this? Thanks a lot, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Monday, May 29, 2006 4:45 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra=20 whitespace=20 <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr= oup_ id=3D24399&atid=3D381399>=20 so you can track it. Derrick Jay Kim wrote: > Hi, > > I'm trying to get the word count using htmlparser, but it doesn't seem > to be able to handle the following example. > > Let's say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you'll see the word 'AAAAA' three=20 > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second 'AAAAA' into two words because of the=20 > font tag in the middle. And, the word count from the parsed text would > be "2". > > Is there any way that I can get the same text/string/word that I see=20 > on the browser? > > Thanks, > > Jay > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |