Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
|
From: Jay K. <jy...@eq...> - 2006-06-02 22:19:36
|
Derrick,
I ran into another issue while finding the location of the specific
word.
It happened when I tested with a table. For example, here is the source
of sample HTML:
<HTML>
<head>
<title>Test HTML </title>
</head>
<body>
<table border=3D1>
<tr>
<td>AAA</td>
<td>BBB</td>
<td>CCC</td>
</tr>
<tr>
<td>BBB</td>
<td>CCC</td>
<td>DDD</td>
</tr>
<tr>
<td>AAA</td>
<td>BBB</td>
<td>CCC</td>
</tr>
</table>
</body>
</HTML>
And, if I load it in a browser, it'll look like this (with borders):
AAA BBB CCC=20
BBB CCC DDD=20
AAA BBB CCC
So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word
count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first.
But, the htmlparser traverse nodes differently - it seems like it
detects 'BBB' in (row[2], col [1]) first before it detects the one in
row[1].
Is there any way to configure the parser to look into the first row
first (or, top-down on the view)?
Please let me know if anything is not clear to you.
Thanks,
=20
Jay
=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Friday, June 02, 2006 4:53 AM
To: This is the user list of htmlparser
Subject: Re: [Htmlparser-user] Finding a whole word
You'll need to manipulate the children() NodeList of the parent of the=20
node you want to tag:
NodeList siblings =3D
text_node_with_the_text.getParent().getChildren();
You'll need to change the text of the original node to have only the=20
text up to the insertion, then add the <a> and </a> nodes and another=20
text node with the rest of the text.
Jay Kim wrote:
>Derrick,
>
>Thanks for your comments. I still have to experiments with different
>files to see what's going on with the start position.
>Assuming that I can get the correct position/offset for the specific
>word, and then store the position information, the next step is to
>create a HTML tag at that position. For example,
>
>Original source:
>
><html>
><head>
><title>test</title>
></head>
><body>
><h1>this is test</h1>
><p>AAA BBB CCC DDD
><p>EEE FFF GGG HHH
>...
></body>
></html>
>
>And, let's say the search word is "GGG", and location is identified,
and
>I need to create the following HTML.
>
><html>
><head>
><title>test</title>
></head>
><body>
><h1>this is test</h1>
><p>AAA BBB CCC DDD
><p>EEE FFF <a name=3D"mytag"></a>GGG HHH
>...
></body>
></html>
>
>I've tried StringBean to achieve this by overriding visitTag,
>visitStringNode, and etc., but I don't know if it's the best way.
>Because once you know the word position, you don't have to go through
>each node using Visitor, right?
>Also, I want to preserve the original HTML format as much as possible.
>Please let me know what would be the best way to generate modified HTML
>by inserting some custom tags at the pre-selected locations.
>
>As always, thank you very much for your kind help,
>=20
>Jay
>=20
>
>-----Original Message-----
>From: htm...@li...
>[mailto:htm...@li...] On Behalf Of
>Derrick Oswald
>Sent: Thursday, June 01, 2006 4:56 AM
>To: htm...@li...
>Subject: Re: [Htmlparser-user] Finding a whole word
>
>Jay,
>
>Your count may be off because the parser may be fetching a different=20
>page from the one you counted.
>HTTP servers may change the page based on the user agent.
>It's only really reliable from a file, unless you save the contents of=20
>the page the parser is working with (see Page.getText()).
>And, yes, \r\n are turned into a single \n in the Text node, but the=20
>node positions don't count this.
>The Page class has getRow() and getColumn() so you can compare with the
>numbers reported by a text editor, which saves manual counting. Note=20
>that these are zero-based, not one-based like most editors.
>
>Your second problem is really up to you, the programmer, to remember=20
>which nodes the strings came from.
>The string offset is only relative to the node position, which is=20
>absolute on the page.
>If I were you I would create an index of node position and string=20
>position as you form the text in visitStringNode.
>
>Derrick
>
>Jay Kim wrote:
>
> =20
>
>>Hi Derrick,
>>
>>Thanks very much for your help. I've tried your sample code, and it
>>gives me the right text that I can compare with.
>>But, I have couple of issues to get the offset of the searching word.
>>
>>1. When I try Text.getStartPosition(), it's not matched with the
>>character count that I get from the HTML source file - yeah, I counted
>>one by one myself. It's like 15 characters off. For example, the
>>character count that I got from the parser was 154, as apposed to 139
>>that I counted from the file.
>>The numbers are still off even if I include/exclude new line
>> =20
>>
>characters.
> =20
>
>>Are there some other factors that I'm not aware of?
>>
>>2. After I found the node that contains the word(string) that I'm
>>searching for, I need to get the offset of that word. For example,
>> Node text =3D AAA BBB CCC DDD BBB EEE
>>And, if the word that I'm searching for is the second 'BBB', is there
>>any reliable way to get the offset of that word? (I can't just get the
>>index form that string because HTML string could be different).
>>Please let me know.
>>
>>Thanks,
>>
>>Jay
>>
>>
>>=20
>>
>> =20
>>
>
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
> =20
>
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|