Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Jay K. <jy...@eq...> - 2006-06-02 22:19:36
|
Derrick, I ran into another issue while finding the location of the specific word. It happened when I tested with a table. For example, here is the source of sample HTML: <HTML> <head> <title>Test HTML </title> </head> <body> <table border=3D1> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> <tr> <td>BBB</td> <td>CCC</td> <td>DDD</td> </tr> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> </table> </body> </HTML> And, if I load it in a browser, it'll look like this (with borders): AAA BBB CCC=20 BBB CCC DDD=20 AAA BBB CCC So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. But, the htmlparser traverse nodes differently - it seems like it detects 'BBB' in (row[2], col [1]) first before it detects the one in row[1]. Is there any way to configure the parser to look into the first row first (or, top-down on the view)? Please let me know if anything is not clear to you. Thanks, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Friday, June 02, 2006 4:53 AM To: This is the user list of htmlparser Subject: Re: [Htmlparser-user] Finding a whole word You'll need to manipulate the children() NodeList of the parent of the=20 node you want to tag: NodeList siblings =3D text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the=20 text up to the insertion, then add the <a> and </a> nodes and another=20 text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name=3D"mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, >=20 >Jay >=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different=20 >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of=20 >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the=20 >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note=20 >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember=20 >which nodes the strings came from. >The string offset is only relative to the node position, which is=20 >absolute on the page. >If I were you I would create an index of node position and string=20 >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > =20 > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> =20 >> >characters. > =20 > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text =3D AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >>=20 >> >> =20 >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > =20 > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |