Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-06-02 11:53:50
|
You'll need to manipulate the children() NodeList of the parent of the node you want to tag: NodeList siblings = text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the text up to the insertion, then add the <a> and </a> nodes and another text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name="mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, > >Jay > > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember >which nodes the strings came from. >The string offset is only relative to the node position, which is >absolute on the page. >If I were you I would create an index of node position and string >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> >> >characters. > > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text = AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >> >> >> >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |