[Htmlparser-user] =?us-ascii?Q?RE:_=5BHtmlparser-user=5D_RE:_=5BHtmlparser-user=5D_RE:?= =?us-ascii
Brought to you by:
derrickoswald
|
From: Jay K. <jy...@eq...> - 2006-06-01 02:16:40
|
Hi Derrick,
Thanks very much for your help. I've tried your sample code, and it
gives me the right text that I can compare with.
But, I have couple of issues to get the offset of the searching word.
1. When I try Text.getStartPosition(), it's not matched with the
character count that I get from the HTML source file - yeah, I counted
one by one myself. It's like 15 characters off. For example, the
character count that I got from the parser was 154, as apposed to 139
that I counted from the file.
The numbers are still off even if I include/exclude new line characters.
Are there some other factors that I'm not aware of?
2. After I found the node that contains the word(string) that I'm
searching for, I need to get the offset of that word. For example,
Node text =3D AAA BBB CCC DDD BBB EEE
And, if the word that I'm searching for is the second 'BBB', is there
any reliable way to get the offset of that word? (I can't just get the
index form that string because HTML string could be different).
Please let me know.
Thanks,
=20
Jay
=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Tuesday, May 30, 2006 3:16 PM
To: htm...@li...
Subject: Re: [Htmlparser-user] RE: [Htmlparser-user] RE:
[Htmlparser-user] Finding a whole word
You probably want to override visitStringNode (Text string) in the=20
StringBean like you've done, but you'll need to be smarter about it.=20
Like keeping track of where you are (in whitespace or not), perhaps by=20
looking at the last character in the StringBuffer and the first=20
character in the incoming text (the default behaviour is to just slap=20
them together - see below). That and parsing the incoming text to break=20
it into words. Each node has a getStartPosition () nethod that will tell
you where you are in the HTML page in units of characters.
/**
* Appends the text to the output.
* @param string The text node.
*/
public void visitStringNode (Text string)
{
if (!mIsScript && !mIsStyle)
{
String text =3D string.getText ();
if (!mIsPre)
{
text =3D Translate.decode (text);
if (getReplaceNonBreakingSpaces ())
text =3D text.replace ('\u00a0', ' ');
if (getCollapse ())
collapse (mBuffer, text);
else
mBuffer.append (text);
}
else
mBuffer.append (text);
}
}
Jay Kim wrote:
>Let me describe more on the the problems of using StringBean as a
>NodeVisitor.
>Here is my code snippet:
>
> private class TestVisitor extends StringBean {
> @Override
> public void visitStringNode(Text text) {
> System.out.println("text=3D" + text.getText());
> }
> }
>
> TestVisitor visitor =3D new TestVisitor();
> visitor.setCollapse(false);
> htmlParser.visitAllNodesWith(visitor);
>
>And, if I feed the sample HTML below, the visitStringNode() methods
does
>not detect the second 'AAAAA' as one word, but instead, it splits into
>two words ('AAA' and 'AA'), which is basically the same problem that=
I
>described in the first email.
>Please let me know.
>Thanks,
>=20
>Jay=20
>
>-----Original Message-----
>From: htm...@li...
>[mailto:htm...@li...] On Behalf Of Jay
>Kim
>Sent: Tuesday, May 30, 2006 10:45 AM
>To: htm...@li...
>Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word
>
>Derrick,
>
>Thank you so much for your quick respond, and getting back to me with
>the solution.
>Now that I'm able to count the number of words appears in a HTML file
>correctly, my next task is to find out the offset (start position) of
>each words. I'm guessing that I probably have to use NodeVisitor with
>StringBean, but I'd like to get some guidelines before I dig into the
>APIs.
>So, for the following sample HTML:
>
><HTML>
><head>
><title>Test HTML</title>
></head>
><body>
><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
></body>
></HTML>
>
>If I search for 'AAAAA', I want to get three matches with their
starting
>positions (offsets), such as,
> Match 1 offset =3D 58
> Match 2 offset =3D 70
> Match 3 offset =3D 108
>
>Could you show me how to achieve this?
>Thanks a lot,
>=20
>Jay
>=20
>
>-----Original Message-----
>From: htm...@li...
>[mailto:htm...@li...] On Behalf Of
>Derrick Oswald
>Sent: Monday, May 29, 2006 4:45 AM
>To: htm...@li...
>Subject: Re: [Htmlparser-user] Finding a whole word
>
>Jay
>The text you want can be obtained with the StringBean if Collapse is
>false.
>
>When collapse is true, there is a bug in the StringBean.
>I've logged this as bug #1496863 StringBean collapse() adds extra=20
>whitespace=20
><http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&g=
roup
_
>id=3D24399&atid=3D381399>=20
>so you can track it.
>Derrick
>
>Jay Kim wrote:
>
> =20
>
>>Hi,
>>
>>I'm trying to get the word count using htmlparser, but it doesn't seem
>> =20
>>
>
> =20
>
>>to be able to handle the following example.
>>
>>Let's say the source html looks like this:
>>
>><HTML>
>>
>><head>
>>
>><title>Test HTML</title>
>>
>></head>
>>
>><body>
>>
>><p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
>>
>></body>
>>
>></HTML>
>>
>>And, if you load it in a browser, you'll see the word 'AAAAA' three=20
>>times.
>>
>>But, if you parse this html, it returns following nodes:
>>
>>AAAAA BBBBB AAA AA BBBBB AAAAA
>>
>>So, it breaks down the second 'AAAAA' into two words because of the=20
>>font tag in the middle. And, the word count from the parsed text would
>> =20
>>
>
> =20
>
>>be "2".
>>
>>Is there any way that I can get the same text/string/word that I see=20
>>on the browser?
>>
>>Thanks,
>>
>>Jay
>>
>> =20
>>
>
>
>
>-------------------------------------------------------
>All the advantages of Linux Managed Hosting--Without the Cost and Risk!
>Fully trained technicians. The highest number of Red Hat certifications
>in
>the hosting industry. Fanatical Support. Click to learn more
>http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D12164
2
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>-------------------------------------------------------
>All the advantages of Linux Managed Hosting--Without the Cost and Risk!
>Fully trained technicians. The highest number of Red Hat certifications
>in
>the hosting industry. Fanatical Support. Click to learn more
>http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>-------------------------------------------------------
>All the advantages of Linux Managed Hosting--Without the Cost and Risk!
>Fully trained technicians. The highest number of Red Hat certifications
in
>the hosting industry. Fanatical Support. Click to learn more
>http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=107521&bid$8729&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
> =20
>
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|