Thread: Re: [Htmlparser-user] Finding a whole word

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Derrick,

Thanks for your comments. I still have to experiments with different
files to see what's going on with the start position.
Assuming that I can get the correct position/offset for the specific
word, and then store the position information, the next step is to
create a HTML tag at that position. For example,

Original source:

And, let's say the search word is "GGG", and location is identified, and
I need to create the following HTML.

I've tried StringBean to achieve this by overriding visitTag,
visitStringNode, and etc., but I don't know if it's the best way.
Because once you know the word position, you don't have to go through
each node using Visitor, right?
Also, I want to preserve the original HTML format as much as possible.
Please let me know what would be the best way to generate modified HTML
by inserting some custom tags at the pre-selected locations.

As always, thank you very much for your kind help,
=20
Jay
=20

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Thursday, June 01, 2006 4:56 AM
To: htm...@li...
Subject: Re: [Htmlparser-user] Finding a whole word

Jay,

Your count may be off because the parser may be fetching a different=20
page from the one you counted.
HTTP servers may change the page based on the user agent.
It's only really reliable from a file, unless you save the contents of=20
the page the parser is working with (see Page.getText()).
And, yes, \r\n are turned into a single \n in the Text node, but the=20
node positions don't count this.
The Page class has getRow() and getColumn() so you can compare with the=20
numbers reported by a text editor, which saves manual counting. Note=20
that these are zero-based, not one-based like most  editors.

Your second problem is really up to you, the programmer, to remember=20
which nodes the strings came from.
The string offset is only relative to the node position, which is=20
absolute on the page.
If I were you I would create an index of node position and string=20
position as you form the text in visitStringNode.

Derrick

Jay Kim wrote:

>Hi Derrick,
>
>Thanks very much for your help. I've tried your sample code, and it
>gives me the right text that I can compare with.
>But, I have couple of issues to get the offset of the searching word.
>
>1. When I try Text.getStartPosition(), it's not matched with the
>character count that I get from the HTML source file - yeah, I counted
>one by one myself. It's like 15 characters off. For example, the
>character count that I got from the parser was 154, as apposed to 139
>that I counted from the file.
>The numbers are still off even if I include/exclude new line
characters.
>Are there some other factors that I'm not aware of?
>
>2. After I found the node that contains the word(string) that I'm
>searching for, I need to get the offset of that word. For example,
>	Node text =3D AAA BBB CCC DDD BBB EEE
>And, if the word that I'm searching for is the second 'BBB', is there
>any reliable way to get the offset of that word? (I can't just get the
>index form that string because HTML string could be different).
>Please let me know.
>
>Thanks,
>=20
>Jay
>=20
>
> =20
>

_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user