Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2006-06-02 11:53:50
|
You'll need to manipulate the children() NodeList of the parent of the
node you want to tag:
NodeList siblings = text_node_with_the_text.getParent().getChildren();
You'll need to change the text of the original node to have only the
text up to the insertion, then add the <a> and </a> nodes and another
text node with the rest of the text.
Jay Kim wrote:
>Derrick,
>
>Thanks for your comments. I still have to experiments with different
>files to see what's going on with the start position.
>Assuming that I can get the correct position/offset for the specific
>word, and then store the position information, the next step is to
>create a HTML tag at that position. For example,
>
>Original source:
>
><html>
><head>
><title>test</title>
></head>
><body>
><h1>this is test</h1>
><p>AAA BBB CCC DDD
><p>EEE FFF GGG HHH
>...
></body>
></html>
>
>And, let's say the search word is "GGG", and location is identified, and
>I need to create the following HTML.
>
><html>
><head>
><title>test</title>
></head>
><body>
><h1>this is test</h1>
><p>AAA BBB CCC DDD
><p>EEE FFF <a name="mytag"></a>GGG HHH
>...
></body>
></html>
>
>I've tried StringBean to achieve this by overriding visitTag,
>visitStringNode, and etc., but I don't know if it's the best way.
>Because once you know the word position, you don't have to go through
>each node using Visitor, right?
>Also, I want to preserve the original HTML format as much as possible.
>Please let me know what would be the best way to generate modified HTML
>by inserting some custom tags at the pre-selected locations.
>
>As always, thank you very much for your kind help,
>
>Jay
>
>
>-----Original Message-----
>From: htm...@li...
>[mailto:htm...@li...] On Behalf Of
>Derrick Oswald
>Sent: Thursday, June 01, 2006 4:56 AM
>To: htm...@li...
>Subject: Re: [Htmlparser-user] Finding a whole word
>
>Jay,
>
>Your count may be off because the parser may be fetching a different
>page from the one you counted.
>HTTP servers may change the page based on the user agent.
>It's only really reliable from a file, unless you save the contents of
>the page the parser is working with (see Page.getText()).
>And, yes, \r\n are turned into a single \n in the Text node, but the
>node positions don't count this.
>The Page class has getRow() and getColumn() so you can compare with the
>numbers reported by a text editor, which saves manual counting. Note
>that these are zero-based, not one-based like most editors.
>
>Your second problem is really up to you, the programmer, to remember
>which nodes the strings came from.
>The string offset is only relative to the node position, which is
>absolute on the page.
>If I were you I would create an index of node position and string
>position as you form the text in visitStringNode.
>
>Derrick
>
>Jay Kim wrote:
>
>
>
>>Hi Derrick,
>>
>>Thanks very much for your help. I've tried your sample code, and it
>>gives me the right text that I can compare with.
>>But, I have couple of issues to get the offset of the searching word.
>>
>>1. When I try Text.getStartPosition(), it's not matched with the
>>character count that I get from the HTML source file - yeah, I counted
>>one by one myself. It's like 15 characters off. For example, the
>>character count that I got from the parser was 154, as apposed to 139
>>that I counted from the file.
>>The numbers are still off even if I include/exclude new line
>>
>>
>characters.
>
>
>>Are there some other factors that I'm not aware of?
>>
>>2. After I found the node that contains the word(string) that I'm
>>searching for, I need to get the offset of that word. For example,
>> Node text = AAA BBB CCC DDD BBB EEE
>>And, if the word that I'm searching for is the second 'BBB', is there
>>any reliable way to get the offset of that word? (I can't just get the
>>index form that string because HTML string could be different).
>>Please let me know.
>>
>>Thanks,
>>
>>Jay
>>
>>
>>
>>
>>
>>
>
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|