Thread: Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Jay K. <jy...@eq...> - 2006-06-01 23:39:04
|
Derrick, Thanks for your comments. I still have to experiments with different files to see what's going on with the start position. Assuming that I can get the correct position/offset for the specific word, and then store the position information, the next step is to create a HTML tag at that position. For example, Original source: <html> <head> <title>test</title> </head> <body> <h1>this is test</h1> <p>AAA BBB CCC DDD <p>EEE FFF GGG HHH ... </body> </html> And, let's say the search word is "GGG", and location is identified, and I need to create the following HTML. <html> <head> <title>test</title> </head> <body> <h1>this is test</h1> <p>AAA BBB CCC DDD <p>EEE FFF <a name=3D"mytag"></a>GGG HHH ... </body> </html> I've tried StringBean to achieve this by overriding visitTag, visitStringNode, and etc., but I don't know if it's the best way. Because once you know the word position, you don't have to go through each node using Visitor, right? Also, I want to preserve the original HTML format as much as possible. Please let me know what would be the best way to generate modified HTML by inserting some custom tags at the pre-selected locations. As always, thank you very much for your kind help, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Thursday, June 01, 2006 4:56 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay, Your count may be off because the parser may be fetching a different=20 page from the one you counted. HTTP servers may change the page based on the user agent. It's only really reliable from a file, unless you save the contents of=20 the page the parser is working with (see Page.getText()). And, yes, \r\n are turned into a single \n in the Text node, but the=20 node positions don't count this. The Page class has getRow() and getColumn() so you can compare with the=20 numbers reported by a text editor, which saves manual counting. Note=20 that these are zero-based, not one-based like most editors. Your second problem is really up to you, the programmer, to remember=20 which nodes the strings came from. The string offset is only relative to the node position, which is=20 absolute on the page. If I were you I would create an index of node position and string=20 position as you form the text in visitStringNode. Derrick Jay Kim wrote: >Hi Derrick, > >Thanks very much for your help. I've tried your sample code, and it >gives me the right text that I can compare with. >But, I have couple of issues to get the offset of the searching word. > >1. When I try Text.getStartPosition(), it's not matched with the >character count that I get from the HTML source file - yeah, I counted >one by one myself. It's like 15 characters off. For example, the >character count that I got from the parser was 154, as apposed to 139 >that I counted from the file. >The numbers are still off even if I include/exclude new line characters. >Are there some other factors that I'm not aware of? > >2. After I found the node that contains the word(string) that I'm >searching for, I need to get the offset of that word. For example, > Node text =3D AAA BBB CCC DDD BBB EEE >And, if the word that I'm searching for is the second 'BBB', is there >any reliable way to get the offset of that word? (I can't just get the >index form that string because HTML string could be different). >Please let me know. > >Thanks, >=20 >Jay >=20 > > =20 > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Jay K. <jy...@eq...> - 2006-06-02 22:19:36
|
Derrick, I ran into another issue while finding the location of the specific word. It happened when I tested with a table. For example, here is the source of sample HTML: <HTML> <head> <title>Test HTML </title> </head> <body> <table border=3D1> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> <tr> <td>BBB</td> <td>CCC</td> <td>DDD</td> </tr> <tr> <td>AAA</td> <td>BBB</td> <td>CCC</td> </tr> </table> </body> </HTML> And, if I load it in a browser, it'll look like this (with borders): AAA BBB CCC=20 BBB CCC DDD=20 AAA BBB CCC So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. But, the htmlparser traverse nodes differently - it seems like it detects 'BBB' in (row[2], col [1]) first before it detects the one in row[1]. Is there any way to configure the parser to look into the first row first (or, top-down on the view)? Please let me know if anything is not clear to you. Thanks, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Friday, June 02, 2006 4:53 AM To: This is the user list of htmlparser Subject: Re: [Htmlparser-user] Finding a whole word You'll need to manipulate the children() NodeList of the parent of the=20 node you want to tag: NodeList siblings =3D text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the=20 text up to the insertion, then add the <a> and </a> nodes and another=20 text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name=3D"mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, >=20 >Jay >=20 > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different=20 >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of=20 >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the=20 >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note=20 >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember=20 >which nodes the strings came from. >The string offset is only relative to the node position, which is=20 >absolute on the page. >If I were you I would create an index of node position and string=20 >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > =20 > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> =20 >> >characters. > =20 > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text =3D AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >>=20 >> >> =20 >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > =20 > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ian M. <ian...@gm...> - 2006-06-06 01:46:17
|
NodeTreeWalker lets you choose depth first of breadth first iteration, but looking at the code, off the top of my head parsing that code should lead to row 1 being reached first in both situations. Ian On 6/2/06, Jay Kim <jy...@eq...> wrote: > > Derrick, > > I ran into another issue while finding the location of the specific > word. > It happened when I tested with a table. For example, here is the source > of sample HTML: > > <HTML> > <head> > <title>Test HTML </title> > </head> > <body> > <table border=1> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > <tr> > <td>BBB</td> > <td>CCC</td> > <td>DDD</td> > </tr> > <tr> > <td>AAA</td> > <td>BBB</td> > <td>CCC</td> > </tr> > </table> > </body> > </HTML> > > And, if I load it in a browser, it'll look like this (with borders): > > AAA BBB CCC > BBB CCC DDD > AAA BBB CCC > > So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word > count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first. > But, the htmlparser traverse nodes differently - it seems like it > detects 'BBB' in (row[2], col [1]) first before it detects the one in > row[1]. > > Is there any way to configure the parser to look into the first row > first (or, top-down on the view)? > > Please let me know if anything is not clear to you. > > Thanks, > > Jay > > -----Original Message----- > From: htm...@li... > [mailto:htm...@li...] On Behalf Of > Derrick Oswald > Sent: Friday, June 02, 2006 4:53 AM > To: This is the user list of htmlparser > Subject: Re: [Htmlparser-user] Finding a whole word > > You'll need to manipulate the children() NodeList of the parent of the > node you want to tag: > NodeList siblings = > text_node_with_the_text.getParent().getChildren(); > You'll need to change the text of the original node to have only the > text up to the insertion, then add the <a> and </a> nodes and another > text node with the rest of the text. > > Jay Kim wrote: > > >Derrick, > > > >Thanks for your comments. I still have to experiments with different > >files to see what's going on with the start position. > >Assuming that I can get the correct position/offset for the specific > >word, and then store the position information, the next step is to > >create a HTML tag at that position. For example, > > > >Original source: > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF GGG HHH > >... > ></body> > ></html> > > > >And, let's say the search word is "GGG", and location is identified, > and > >I need to create the following HTML. > > > ><html> > ><head> > ><title>test</title> > ></head> > ><body> > ><h1>this is test</h1> > ><p>AAA BBB CCC DDD > ><p>EEE FFF <a name="mytag"></a>GGG HHH > >... > ></body> > ></html> > > > >I've tried StringBean to achieve this by overriding visitTag, > >visitStringNode, and etc., but I don't know if it's the best way. > >Because once you know the word position, you don't have to go through > >each node using Visitor, right? > >Also, I want to preserve the original HTML format as much as possible. > >Please let me know what would be the best way to generate modified HTML > >by inserting some custom tags at the pre-selected locations. > > > >As always, thank you very much for your kind help, > > > >Jay > > > > > >-----Original Message----- > >From: htm...@li... > >[mailto:htm...@li...] On Behalf Of > >Derrick Oswald > >Sent: Thursday, June 01, 2006 4:56 AM > >To: htm...@li... > >Subject: Re: [Htmlparser-user] Finding a whole word > > > >Jay, > > > >Your count may be off because the parser may be fetching a different > >page from the one you counted. > >HTTP servers may change the page based on the user agent. > >It's only really reliable from a file, unless you save the contents of > >the page the parser is working with (see Page.getText()). > >And, yes, \r\n are turned into a single \n in the Text node, but the > >node positions don't count this. > >The Page class has getRow() and getColumn() so you can compare with the > > >numbers reported by a text editor, which saves manual counting. Note > >that these are zero-based, not one-based like most editors. > > > >Your second problem is really up to you, the programmer, to remember > >which nodes the strings came from. > >The string offset is only relative to the node position, which is > >absolute on the page. > >If I were you I would create an index of node position and string > >position as you form the text in visitStringNode. > > > >Derrick > > > >Jay Kim wrote: > > > > > > > >>Hi Derrick, > >> > >>Thanks very much for your help. I've tried your sample code, and it > >>gives me the right text that I can compare with. > >>But, I have couple of issues to get the offset of the searching word. > >> > >>1. When I try Text.getStartPosition(), it's not matched with the > >>character count that I get from the HTML source file - yeah, I counted > >>one by one myself. It's like 15 characters off. For example, the > >>character count that I got from the parser was 154, as apposed to 139 > >>that I counted from the file. > >>The numbers are still off even if I include/exclude new line > >> > >> > >characters. > > > > > >>Are there some other factors that I'm not aware of? > >> > >>2. After I found the node that contains the word(string) that I'm > >>searching for, I need to get the offset of that word. For example, > >> Node text = AAA BBB CCC DDD BBB EEE > >>And, if the word that I'm searching for is the second 'BBB', is there > >>any reliable way to get the offset of that word? (I can't just get the > >>index form that string because HTML string could be different). > >>Please let me know. > >> > >>Thanks, > >> > >>Jay > >> > >> > >> > >> > >> > >> > > > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-06-02 11:53:50
|
You'll need to manipulate the children() NodeList of the parent of the node you want to tag: NodeList siblings = text_node_with_the_text.getParent().getChildren(); You'll need to change the text of the original node to have only the text up to the insertion, then add the <a> and </a> nodes and another text node with the rest of the text. Jay Kim wrote: >Derrick, > >Thanks for your comments. I still have to experiments with different >files to see what's going on with the start position. >Assuming that I can get the correct position/offset for the specific >word, and then store the position information, the next step is to >create a HTML tag at that position. For example, > >Original source: > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF GGG HHH >... ></body> ></html> > >And, let's say the search word is "GGG", and location is identified, and >I need to create the following HTML. > ><html> ><head> ><title>test</title> ></head> ><body> ><h1>this is test</h1> ><p>AAA BBB CCC DDD ><p>EEE FFF <a name="mytag"></a>GGG HHH >... ></body> ></html> > >I've tried StringBean to achieve this by overriding visitTag, >visitStringNode, and etc., but I don't know if it's the best way. >Because once you know the word position, you don't have to go through >each node using Visitor, right? >Also, I want to preserve the original HTML format as much as possible. >Please let me know what would be the best way to generate modified HTML >by inserting some custom tags at the pre-selected locations. > >As always, thank you very much for your kind help, > >Jay > > >-----Original Message----- >From: htm...@li... >[mailto:htm...@li...] On Behalf Of >Derrick Oswald >Sent: Thursday, June 01, 2006 4:56 AM >To: htm...@li... >Subject: Re: [Htmlparser-user] Finding a whole word > >Jay, > >Your count may be off because the parser may be fetching a different >page from the one you counted. >HTTP servers may change the page based on the user agent. >It's only really reliable from a file, unless you save the contents of >the page the parser is working with (see Page.getText()). >And, yes, \r\n are turned into a single \n in the Text node, but the >node positions don't count this. >The Page class has getRow() and getColumn() so you can compare with the >numbers reported by a text editor, which saves manual counting. Note >that these are zero-based, not one-based like most editors. > >Your second problem is really up to you, the programmer, to remember >which nodes the strings came from. >The string offset is only relative to the node position, which is >absolute on the page. >If I were you I would create an index of node position and string >position as you form the text in visitStringNode. > >Derrick > >Jay Kim wrote: > > > >>Hi Derrick, >> >>Thanks very much for your help. I've tried your sample code, and it >>gives me the right text that I can compare with. >>But, I have couple of issues to get the offset of the searching word. >> >>1. When I try Text.getStartPosition(), it's not matched with the >>character count that I get from the HTML source file - yeah, I counted >>one by one myself. It's like 15 characters off. For example, the >>character count that I got from the parser was 154, as apposed to 139 >>that I counted from the file. >>The numbers are still off even if I include/exclude new line >> >> >characters. > > >>Are there some other factors that I'm not aware of? >> >>2. After I found the node that contains the word(string) that I'm >>searching for, I need to get the offset of that word. For example, >> Node text = AAA BBB CCC DDD BBB EEE >>And, if the word that I'm searching for is the second 'BBB', is there >>any reliable way to get the offset of that word? (I can't just get the >>index form that string because HTML string could be different). >>Please let me know. >> >>Thanks, >> >>Jay >> >> >> >> >> >> > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |