Thread: [Htmlparser-user] Newbie Problem with HasChildFilter
Brought to you by:
derrickoswald
From: Roger V. <rog...@go...> - 2009-07-06 08:33:47
|
Hi I'm probably doing something stupid here, but I can't get the HasChildFilter to work properly. I am trying to get all the <a> tags that occur inside the <body> tag so I can re-write them. I don't want the javascript generated tags that occur inside the <head> tag. My test case is below. String testHtml = "<html><head><script><a href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] + \"','main')>" +"</script><body><table><tr><td>Cell Content</td></tr><tr><td>" +"<a target=\"main\" href=\"findXml.jsp?XMLFile=G455051\">Control Mechanism</a></td></tr></table></body></html>"; Parser parser = new Parser(testHtml); NodeList originalPage = parser.parse(null); NodeFilter filter = new AndFilter(new TagNameFilter("body"), new HasChildFilter(new TagNameFilter("a"),true)); NodeList extract = originalPage.extractAllNodesThatMatch(filter, true); This fails to find any of the <a> tags - extract.size() is zero. Can someone point out what I'm doing wrong please. Regards |
From: Derrick O. <der...@gm...> - 2009-07-06 10:01:40
|
I think the TagNameFilter is case sensitive so it should be: NodeFilter filter = new AndFilter(new TagNameFilter("BODY"), new HasChildFilter(new TagNameFilter("A"),true)); But, the filter you've constructed would find the BODY tag: keep: tag named BODY and has a child named A If you want the A tags insid the BODY tag it would be: NodeFilter filter = new AndFilter(new TagNameFilter("A"), new HasParentFilter(new TagNameFilter("BODY"),true)); On Mon, Jul 6, 2009 at 10:33 AM, Roger Varley <rog...@go...>wrote: > Hi > > I'm probably doing something stupid here, but I can't get the > HasChildFilter to work properly. I am trying to get all the <a> tags > that occur inside the <body> tag so I can re-write them. I don't want > the javascript generated tags that occur inside the <head> tag. My > test case is below. > > String testHtml = "<html><head><script><a href=JAVASCRIPT:openProc('\" > + parent.contents.procUID[i] + \"','main')>" > +"</script><body><table><tr><td>Cell > Content</td></tr><tr><td>" > +"<a target=\"main\" > href=\"findXml.jsp?XMLFile=G455051\">Control > Mechanism</a></td></tr></table></body></html>"; > > Parser parser = new Parser(testHtml); > NodeList originalPage = parser.parse(null); > NodeFilter filter = new AndFilter(new TagNameFilter("body"), > new HasChildFilter(new TagNameFilter("a"),true)); > NodeList extract = originalPage.extractAllNodesThatMatch(filter, true); > > This fails to find any of the <a> tags - extract.size() is zero. Can > someone point out > what I'm doing wrong please. > > Regards > > > ------------------------------------------------------------------------------ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Roger V. <rog...@go...> - 2009-07-07 12:11:48
|
> > If you want the A tags insid the BODY tag it would be: > NodeFilter filter = new AndFilter(new TagNameFilter("A"), > new HasParentFilter(new TagNameFilter("BODY"),true)); > Thanks Derek, that worked perfectly. I've now got another problem that I think might be a bug. With the testcase (I'm not making this up - I've actually got work with this sort of stuff!) String testHtml = "<html><head><script><a href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] + \"','main')>" +"</script><body><table><tr><td><img src=/666.jpg\"></td></tr><tr><td>" +"document.write(\"<a href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] + \"','main')>\" + parent.contents.procDisplay[i] + \"</a>\"</a></td></tr></table></body></html>"; Parser parser = new Parser(testHtml); NodeList originalPage = parser.parse(null); NodeFilter filter = new AndFilter(new TagNameFilter("a"), new HasParentFilter(new TagNameFilter("body"),true)); NodeList extract = originalPage.extractAllNodesThatMatch(filter, true); This picks up the second JAVASCRIPT LinkTag - the one outside the <head> tag, but inside the document.write(). When I try to evaluate LinkTag.getLinkTag() against this, HtmlParser is reporting the text as JAVASCRIPT:openProc('" which is not correct. Any ideas? Regards |
From: Derrick O. <der...@gm...> - 2009-07-07 15:12:36
|
My gut reaction without even looking into it in detail because it is a javascript problem is to tell you to set org.htmlparser.scanners.ScriptScanner.STRICT = false and try it again. On Tue, Jul 7, 2009 at 2:03 PM, Roger Varley <rog...@go...>wrote: > > > > If you want the A tags insid the BODY tag it would be: > > NodeFilter filter = new AndFilter(new TagNameFilter("A"), > > new HasParentFilter(new TagNameFilter("BODY"),true)); > > > > Thanks Derek, that worked perfectly. I've now got another problem that > I think might be a bug. With the testcase > (I'm not making this up - I've actually got work with this sort of stuff!) > > String testHtml = "<html><head><script><a > href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] + > \"','main')>" > +"</script><body><table><tr><td><img > src=/666.jpg\"></td></tr><tr><td>" > +"document.write(\"<a > href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] + > \"','main')>\" + parent.contents.procDisplay[i] + > \"</a>\"</a></td></tr></table></body></html>"; > > Parser parser = new Parser(testHtml); > NodeList originalPage = parser.parse(null); > NodeFilter filter = new AndFilter(new TagNameFilter("a"), > new HasParentFilter(new TagNameFilter("body"),true)); > NodeList extract = originalPage.extractAllNodesThatMatch(filter, true); > > This picks up the second JAVASCRIPT LinkTag - the one outside the > <head> tag, but inside the document.write(). When I try to evaluate > LinkTag.getLinkTag() against this, HtmlParser is reporting the text as > JAVASCRIPT:openProc('" which is not correct. Any ideas? > > Regards > > > ------------------------------------------------------------------------------ > Enter the BlackBerry Developer Challenge > This is your chance to win up to $100,000 in prizes! For a limited time, > vendors submitting new applications to BlackBerry App World(TM) will have > the opportunity to enter the BlackBerry Developer Challenge. See full prize > details at: http://p.sf.net/sfu/blackberry > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |