Re: [Htmlparser-user] Finding a whole word

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

NodeTreeWalker lets you choose depth first of breadth first iteration,
but looking at the code, off the top of my head parsing that code
should lead to row 1 being reached first in both situations.

Ian

On 6/2/06, Jay Kim <jy...@eq...> wrote:
>
> Derrick,
>
> I ran into another issue while finding the location of the specific
> word.
> It happened when I tested with a table. For example, here is the source
> of sample HTML:
>
> <HTML>
> <head>
> <title>Test HTML </title>
> </head>
> <body>
> <table border=1>
> <tr>
>         <td>AAA</td>
>         <td>BBB</td>
>         <td>CCC</td>
> </tr>
> <tr>
>         <td>BBB</td>
>         <td>CCC</td>
>         <td>DDD</td>
> </tr>
> <tr>
>         <td>AAA</td>
>         <td>BBB</td>
>         <td>CCC</td>
> </tr>
> </table>
> </body>
> </HTML>
>
> And, if I load it in a browser, it'll look like this (with borders):
>
> AAA BBB CCC
> BBB CCC DDD
> AAA BBB CCC
>
> So, if I select 'BBB' in (row[2], col [1]) on IE, and get the word
> count, it'll return 2 because it counts 'BBB' in (row[1], col[2]) first.
> But, the htmlparser traverse nodes differently - it seems like it
> detects 'BBB' in (row[2], col [1]) first before it detects the one in
> row[1].
>
> Is there any way to configure the parser to look into the first row
> first (or, top-down on the view)?
>
> Please let me know if anything is not clear to you.
>
> Thanks,
>
> Jay
>
> -----Original Message-----
> From: htm...@li...
> [mailto:htm...@li...] On Behalf Of
> Derrick Oswald
> Sent: Friday, June 02, 2006 4:53 AM
> To: This is the user list of htmlparser
> Subject: Re: [Htmlparser-user] Finding a whole word
>
> You'll need to manipulate the children() NodeList of the parent of the
> node you want to tag:
>     NodeList siblings =
> text_node_with_the_text.getParent().getChildren();
> You'll need to change the text of the original node to have only the
> text up to the insertion, then add the <a> and </a> nodes and another
> text node with the rest of the text.
>
> Jay Kim wrote:
>
> >Derrick,
> >
> >Thanks for your comments. I still have to experiments with different
> >files to see what's going on with the start position.
> >Assuming that I can get the correct position/offset for the specific
> >word, and then store the position information, the next step is to
> >create a HTML tag at that position. For example,
> >
> >Original source:
> >
> ><html>
> ><head>
> ><title>test</title>
> ></head>
> ><body>
> ><h1>this is test</h1>
> ><p>AAA BBB CCC DDD
> ><p>EEE FFF GGG HHH
> >...
> ></body>
> ></html>
> >
> >And, let's say the search word is "GGG", and location is identified,
> and
> >I need to create the following HTML.
> >
> ><html>
> ><head>
> ><title>test</title>
> ></head>
> ><body>
> ><h1>this is test</h1>
> ><p>AAA BBB CCC DDD
> ><p>EEE FFF <a name="mytag"></a>GGG HHH
> >...
> ></body>
> ></html>
> >
> >I've tried StringBean to achieve this by overriding visitTag,
> >visitStringNode, and etc., but I don't know if it's the best way.
> >Because once you know the word position, you don't have to go through
> >each node using Visitor, right?
> >Also, I want to preserve the original HTML format as much as possible.
> >Please let me know what would be the best way to generate modified HTML
> >by inserting some custom tags at the pre-selected locations.
> >
> >As always, thank you very much for your kind help,
> >
> >Jay
> >
> >
> >-----Original Message-----
> >From: htm...@li...
> >[mailto:htm...@li...] On Behalf Of
> >Derrick Oswald
> >Sent: Thursday, June 01, 2006 4:56 AM
> >To: htm...@li...
> >Subject: Re: [Htmlparser-user] Finding a whole word
> >
> >Jay,
> >
> >Your count may be off because the parser may be fetching a different
> >page from the one you counted.
> >HTTP servers may change the page based on the user agent.
> >It's only really reliable from a file, unless you save the contents of
> >the page the parser is working with (see Page.getText()).
> >And, yes, \r\n are turned into a single \n in the Text node, but the
> >node positions don't count this.
> >The Page class has getRow() and getColumn() so you can compare with the
>
> >numbers reported by a text editor, which saves manual counting. Note
> >that these are zero-based, not one-based like most  editors.
> >
> >Your second problem is really up to you, the programmer, to remember
> >which nodes the strings came from.
> >The string offset is only relative to the node position, which is
> >absolute on the page.
> >If I were you I would create an index of node position and string
> >position as you form the text in visitStringNode.
> >
> >Derrick
> >
> >Jay Kim wrote:
> >
> >
> >
> >>Hi Derrick,
> >>
> >>Thanks very much for your help. I've tried your sample code, and it
> >>gives me the right text that I can compare with.
> >>But, I have couple of issues to get the offset of the searching word.
> >>
> >>1. When I try Text.getStartPosition(), it's not matched with the
> >>character count that I get from the HTML source file - yeah, I counted
> >>one by one myself. It's like 15 characters off. For example, the
> >>character count that I got from the parser was 154, as apposed to 139
> >>that I counted from the file.
> >>The numbers are still off even if I include/exclude new line
> >>
> >>
> >characters.
> >
> >
> >>Are there some other factors that I'm not aware of?
> >>
> >>2. After I found the node that contains the word(string) that I'm
> >>searching for, I need to get the offset of that word. For example,
> >>      Node text = AAA BBB CCC DDD BBB EEE
> >>And, if the word that I'm searching for is the second 'BBB', is there
> >>any reliable way to get the offset of that word? (I can't just get the
> >>index form that string because HTML string could be different).
> >>Please let me know.
> >>
> >>Thanks,
> >>
> >>Jay
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>