[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_RE:_=5BHtmlparser-user=5D_Fin?= =?us-ascii
Brought to you by:
derrickoswald
|
From: Jay K. <jy...@eq...> - 2006-05-30 20:15:49
|
Let me describe more on the the problems of using StringBean as a
NodeVisitor.
Here is my code snippet:
private class TestVisitor extends StringBean {
@Override
public void visitStringNode(Text text) {
System.out.println("text=3D" + text.getText());
}
}
TestVisitor visitor =3D new TestVisitor();
visitor.setCollapse(false);
htmlParser.visitAllNodesWith(visitor);
And, if I feed the sample HTML below, the visitStringNode() methods does
not detect the second 'AAAAA' as one word, but instead, it splits into
two words ('AAA' and 'AA'), which is basically the same problem that I
described in the first email.
Please let me know.
Thanks,
=20
Jay=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of Jay
Kim
Sent: Tuesday, May 30, 2006 10:45 AM
To: htm...@li...
Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word
Derrick,
Thank you so much for your quick respond, and getting back to me with
the solution.
Now that I'm able to count the number of words appears in a HTML file
correctly, my next task is to find out the offset (start position) of
each words. I'm guessing that I probably have to use NodeVisitor with
StringBean, but I'd like to get some guidelines before I dig into the
APIs.
So, for the following sample HTML:
<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>
If I search for 'AAAAA', I want to get three matches with their starting
positions (offsets), such as,
Match 1 offset =3D 58
Match 2 offset =3D 70
Match 3 offset =3D 108
Could you show me how to achieve this?
Thanks a lot,
=20
Jay
=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Monday, May 29, 2006 4:45 AM
To: htm...@li...
Subject: Re: [Htmlparser-user] Finding a whole word
Jay
The text you want can be obtained with the StringBean if Collapse is
false.
When collapse is true, there is a bug in the StringBean.
I've logged this as bug #1496863 StringBean collapse() adds extra=20
whitespace=20
<http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr=
oup_
id=3D24399&atid=3D381399>=20
so you can track it.
Derrick
Jay Kim wrote:
> Hi,
>
> I'm trying to get the word count using htmlparser, but it doesn't seem
> to be able to handle the following example.
>
> Let's say the source html looks like this:
>
> <HTML>
>
> <head>
>
> <title>Test HTML</title>
>
> </head>
>
> <body>
>
> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
>
> </body>
>
> </HTML>
>
> And, if you load it in a browser, you'll see the word 'AAAAA' three=20
> times.
>
> But, if you parse this html, it returns following nodes:
>
> AAAAA BBBBB AAA AA BBBBB AAAAA
>
> So, it breaks down the second 'AAAAA' into two words because of the=20
> font tag in the middle. And, the word count from the parsed text would
> be "2".
>
> Is there any way that I can get the same text/string/word that I see=20
> on the browser?
>
> Thanks,
>
> Jay
>
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|