Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-06-06 01:46:17
|
NodeTreeWalker lets you choose depth first of breadth first iteration, but looking at the code, off the top of my head parsing that code should lead to row 1 being reached first in both situations. Ian On 6/2/06, Jay Kim <jy...@eq...> wrote: > > Derrick, > > I ran into another issue while finding the location of the specific > word. > It happened when I tested with a table. For example, here is the source > of sample HTML: > > <HTML> > <head> > <title>Test HTML </title> > </head> > <body> > <table border=1> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > <tr> > <td>BBB</td> > <td>CCC</td> > <td>DDD</td> > </tr> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > </table> > </body> > </HTML> > > And, if I load it in a browser, it'll look like this (with borders): > > AAA BBB CCC > BBB CCC DDD > AAA BBB CCC > > So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word > count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. > But, the htmlparser traverse nodes differently - it seems like it > detects 'BBB' in (row[2], col [1]) first before it detects the one in > row[1]. > > Is there any way to configure the parser to look into the first row > first (or, top-down on the view)? > > Please let me know if anything is not clear to you. > > Thanks, > > Jay > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of > Derrick Oswald > Sent: Friday, June 02, 2006 4:53 AM > To: This is the user list of htmlparser > Subject: Re: [Htmlparser-user] Finding a whole word > > You'll need to manipulate the children() NodeList of the parent of the > node you want to tag: > NodeList siblings = > text_node_with_the_text.getParent().getChildren(); > You'll need to change the text of the original node to have only the > text up to the insertion, then add the <a> and </a> nodes and another > text node with the rest of the text. > > Jay Kim wrote: > > >Derrick, > > > >Thanks for your comments. I still have to experiments with different > >files to see what's going on with the start position. > >Assuming that I can get the correct position/offset for the specific > >word, and then store the position information, the next step is to > >create a HTML tag at that position. For example, > > > >Original source: > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF GGG HHH > >... > ></body> > ></html> > > > >And, let's say the search word is "GGG", and location is identified, > and > >I need to create the following HTML. > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF <a name="mytag"></a>GGG HHH > >... > ></body> > ></html> > > > >I've tried StringBean to achieve this by overriding visitTag, > >visitStringNode, and etc., but I don't know if it's the best way. > >Because once you know the word position, you don't have to go through > >each node using Visitor, right? > >Also, I want to preserve the original HTML format as much as possible. > >Please let me know what would be the best way to generate modified HTML > >by inserting some custom tags at the pre-selected locations. > > > >As always, thank you very much for your kind help, > > > >Jay > > > > > >-----Original Message----- > >From: htm...@li... > >[mailto:htm...@li...] On Behalf Of > >Derrick Oswald > >Sent: Thursday, June 01, 2006 4:56 AM > >To: htm...@li... > >Subject: Re: [Htmlparser-user] Finding a whole word > > > >Jay, > > > >Your count may be off because the parser may be fetching a different > >page from the one you counted. > >HTTP servers may change the page based on the user agent. > >It's only really reliable from a file, unless you save the contents of > >the page the parser is working with (see Page.getText()). > >And, yes, \r\n are turned into a single \n in the Text node, but the > >node positions don't count this. > >The Page class has getRow() and getColumn() so you can compare with the > > >numbers reported by a text editor, which saves manual counting. Note > >that these are zero-based, not one-based like most editors. > > > >Your second problem is really up to you, the programmer, to remember > >which nodes the strings came from. > >The string offset is only relative to the node position, which is > >absolute on the page. > >If I were you I would create an index of node position and string > >position as you form the text in visitStringNode. > > > >Derrick > > > >Jay Kim wrote: > > > > > > > >>Hi Derrick, > >> > >>Thanks very much for your help. I've tried your sample code, and it > >>gives me the right text that I can compare with. > >>But, I have couple of issues to get the offset of the searching word. > >> > >>1. When I try Text.getStartPosition(), it's not matched with the > >>character count that I get from the HTML source file - yeah, I counted > >>one by one myself. It's like 15 characters off. For example, the > >>character count that I got from the parser was 154, as apposed to 139 > >>that I counted from the file. > >>The numbers are still off even if I include/exclude new line > >> > >> > >characters. > > > > > >>Are there some other factors that I'm not aware of? > >> > >>2. After I found the node that contains the word(string) that I'm > >>searching for, I need to get the offset of that word. For example, > >> Node text = AAA BBB CCC DDD BBB EEE > >>And, if the word that I'm searching for is the second 'BBB', is there > >>any reliable way to get the offset of that word? (I can't just get the > >>index form that string because HTML string could be different). > >>Please let me know. > >> > >>Thanks, > >> > >>Jay > >> > >> > >> > >> > >> > >> > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |