Re: [Htmlparser-user] Finding a whole word
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-06-01 11:56:39
|
Jay, Your count may be off because the parser may be fetching a different page from the one you counted. HTTP servers may change the page based on the user agent. It's only really reliable from a file, unless you save the contents of the page the parser is working with (see Page.getText()). And, yes, \r\n are turned into a single \n in the Text node, but the node positions don't count this. The Page class has getRow() and getColumn() so you can compare with the numbers reported by a text editor, which saves manual counting. Note that these are zero-based, not one-based like most editors. Your second problem is really up to you, the programmer, to remember which nodes the strings came from. The string offset is only relative to the node position, which is absolute on the page. If I were you I would create an index of node position and string position as you form the text in visitStringNode. Derrick Jay Kim wrote: >Hi Derrick, > >Thanks very much for your help. I've tried your sample code, and it >gives me the right text that I can compare with. >But, I have couple of issues to get the offset of the searching word. > >1. When I try Text.getStartPosition(), it's not matched with the >character count that I get from the HTML source file - yeah, I counted >one by one myself. It's like 15 characters off. For example, the >character count that I got from the parser was 154, as apposed to 139 >that I counted from the file. >The numbers are still off even if I include/exclude new line characters. >Are there some other factors that I'm not aware of? > >2. After I found the node that contains the word(string) that I'm >searching for, I need to get the offset of that word. For example, > Node text = AAA BBB CCC DDD BBB EEE >And, if the word that I'm searching for is the second 'BBB', is there >any reliable way to get the offset of that word? (I can't just get the >index form that string because HTML string could be different). >Please let me know. > >Thanks, > >Jay > > > > |