[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_RE:_=5BHtmlparser-user=5D_RE:?= =?us-ascii
Brought to you by:
derrickoswald
From: Jay K. <jy...@eq...> - 2006-06-01 02:16:40
|
Hi Derrick, Thanks very much for your help. I've tried your sample code, and it gives me the right text that I can compare with. But, I have couple of issues to get the offset of the searching word. 1. When I try Text.getStartPosition(), it's not matched with the character count that I get from the HTML source file - yeah, I counted one by one myself. It's like 15 characters off. For example, the character count that I got from the parser was 154, as apposed to 139 that I counted from the file. The numbers are still off even if I include/exclude new line characters. Are there some other factors that I'm not aware of? 2. After I found the node that contains the word(string) that I'm searching for, I need to get the offset of that word. For example, Node text =3D AAA BBB CCC DDD BBB EEE And, if the word that I'm searching for is the second 'BBB', is there any reliable way to get the offset of that word? (I can't just get the index form that string because HTML string could be different). Please let me know. Thanks, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Tuesday, May 30, 2006 3:16 PM To: htm...@li... Subject: Re: [Htmlparser-user] RE: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word You probably want to override visitStringNode (Text string) in the=20 StringBean like you've done, but you'll need to be smarter about it.=20 Like keeping track of where you are (in whitespace or not), perhaps by=20 looking at the last character in the StringBuffer and the first=20 character in the incoming text (the default behaviour is to just slap=20 them together - see below). That and parsing the incoming text to break=20 it into words. Each node has a getStartPosition () nethod that will tell you where you are in the HTML page in units of characters. /** * Appends the text to the output. * @param string The text node. */ public void visitStringNode (Text string) { if (!mIsScript && !mIsStyle) { String text =3D string.getText (); if (!mIsPre) { text =3D Translate.decode (text); if (getReplaceNonBreakingSpaces ()) text =3D text.replace ('\u00a0', ' '); if (getCollapse ()) collapse (mBuffer, text); else mBuffer.append (text); } else mBuffer.append (text); } } Jay Kim wrote: >Let me describe more on the the problems of using StringBean as a >NodeVisitor. >Here is my code snippet: > > private class TestVisitor extends StringBean { > @Override > public void visitStringNode(Text text) { > System.out.println("text=3D" + text.getText()); > } > } > > TestVisitor visitor =3D new TestVisitor(); > visitor.setCollapse(false); > htmlParser.visitAllNodesWith(visitor); > >And, if I feed the sample HTML below, the visitStringNode() methods does >not detect the second 'AAAAA' as one word, but instead, it splits into >two words ('AAA' and 'AA'), which is basically the same problem that= I >described in the first email. >Please let me know. >Thanks, >=20 >Jay=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of Jay >Kim >Sent: Tuesday, May 30, 2006 10:45 AM >To: htm...@li... >Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word > >Derrick, > >Thank you so much for your quick respond, and getting back to me with >the solution. >Now that I'm able to count the number of words appears in a HTML file >correctly, my next task is to find out the offset (start position) of >each words. I'm guessing that I probably have to use NodeVisitor with >StringBean, but I'd like to get some guidelines before I dig into the >APIs. >So, for the following sample HTML: > ><HTML> ><head> ><title>Test HTML</title> ></head> ><body> ><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> ></body> ></HTML> > >If I search for 'AAAAA', I want to get three matches with their starting >positions (offsets), such as, > Match 1 offset =3D 58 > Match 2 offset =3D 70 > Match 3 offset =3D 108 > >Could you show me how to achieve this? >Thanks a lot, >=20 >Jay >=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Monday, May 29, 2006 4:45 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay >The text you want can be obtained with the StringBean if Collapse is >false. > >When collapse is true, there is a bug in the StringBean. >I've logged this as bug #1496863 StringBean collapse() adds extra=20 >whitespace=20 ><http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&g= roup _ >id=3D24399&atid=3D381399>=20 >so you can track it. >Derrick > >Jay Kim wrote: > > =20 > >>Hi, >> >>I'm trying to get the word count using htmlparser, but it doesn't seem >> =20 >> > > =20 > >>to be able to handle the following example. >> >>Let's say the source html looks like this: >> >><HTML> >> >><head> >> >><title>Test HTML</title> >> >></head> >> >><body> >> >><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> >> >></body> >> >></HTML> >> >>And, if you load it in a browser, you'll see the word 'AAAAA' three=20 >>times. >> >>But, if you parse this html, it returns following nodes: >> >>AAAAA BBBBB AAA AA BBBBB AAAAA >> >>So, it breaks down the second 'AAAAA' into two words because of the=20 >>font tag in the middle. And, the word count from the parsed text would >> =20 >> > > =20 > >>be "2". >> >>Is there any way that I can get the same text/string/word that I see=20 >>on the browser? >> >>Thanks, >> >>Jay >> >> =20 >> > > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D12164 2 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications >in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------- >All the advantages of Linux Managed Hosting--Without the Cost and Risk! >Fully trained technicians. The highest number of Red Hat certifications in >the hosting industry. Fanatical Support. Click to learn more >http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=107521&bid$8729&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > =20 > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |