Thread: [Htmlparser-user] Newbie Problem with HasChildFilter

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] Newbie Problem with HasChildFilter

From: Roger V. <rog...@go...> - 2009-07-06 08:33:47

Hi

I'm probably doing something stupid here, but I can't get the
HasChildFilter to work properly. I am trying to get all the <a> tags
that occur inside the <body> tag so I can re-write them. I don't want
the javascript generated tags that occur inside the <head> tag. My
test case is below.

String testHtml = "<html><head><script><a href=JAVASCRIPT:openProc('\"
+ parent.contents.procUID[i] + \"','main')>"
	                     +"</script><body><table><tr><td>Cell
Content</td></tr><tr><td>"
	                     +"<a target=\"main\"
href=\"findXml.jsp?XMLFile=G455051\">Control
Mechanism</a></td></tr></table></body></html>";

Parser parser = new Parser(testHtml);
NodeList originalPage = parser.parse(null);
NodeFilter filter = new AndFilter(new TagNameFilter("body"),
	        new HasChildFilter(new TagNameFilter("a"),true));
NodeList extract = originalPage.extractAllNodesThatMatch(filter, true);

This fails to find any of the <a> tags - extract.size() is zero. Can
someone point out
what I'm doing wrong please.

Regards

Re: [Htmlparser-user] Newbie Problem with HasChildFilter

From: Derrick O. <der...@gm...> - 2009-07-06 10:01:40

I think the TagNameFilter is case sensitive so it should be:
NodeFilter filter = new AndFilter(new TagNameFilter("BODY"),
               new HasChildFilter(new TagNameFilter("A"),true));

But, the filter you've constructed would find the BODY tag:
keep: tag named BODY and has a child named A

If you want the A tags insid the BODY tag it would be:
NodeFilter filter = new AndFilter(new TagNameFilter("A"),
               new HasParentFilter(new TagNameFilter("BODY"),true));


On Mon, Jul 6, 2009 at 10:33 AM, Roger Varley
<rog...@go...>wrote:

> Hi
>
> I'm probably doing something stupid here, but I can't get the
> HasChildFilter to work properly. I am trying to get all the <a> tags
> that occur inside the <body> tag so I can re-write them. I don't want
> the javascript generated tags that occur inside the <head> tag. My
> test case is below.
>
> String testHtml = "<html><head><script><a href=JAVASCRIPT:openProc('\"
> + parent.contents.procUID[i] + \"','main')>"
>                             +"</script><body><table><tr><td>Cell
> Content</td></tr><tr><td>"
>                             +"<a target=\"main\"
> href=\"findXml.jsp?XMLFile=G455051\">Control
> Mechanism</a></td></tr></table></body></html>";
>
> Parser parser = new Parser(testHtml);
> NodeList originalPage = parser.parse(null);
> NodeFilter filter = new AndFilter(new TagNameFilter("body"),
>                new HasChildFilter(new TagNameFilter("a"),true));
> NodeList extract = originalPage.extractAllNodesThatMatch(filter, true);
>
> This fails to find any of the <a> tags - extract.size() is zero. Can
> someone point out
> what I'm doing wrong please.
>
> Regards
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] Newbie Problem with HasChildFilter

From: Roger V. <rog...@go...> - 2009-07-07 12:11:48

>
> If you want the A tags insid the BODY tag it would be:
> NodeFilter filter = new AndFilter(new TagNameFilter("A"),
>                new HasParentFilter(new TagNameFilter("BODY"),true));
>

Thanks Derek, that worked perfectly. I've now got another problem that
I think might be a bug. With the testcase
(I'm not making this up - I've actually got work with this sort of stuff!)

 String testHtml = "<html><head><script><a
href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] +
\"','main')>"
	                     +"</script><body><table><tr><td><img
src=/666.jpg\"></td></tr><tr><td>"
	                     +"document.write(\"<a
href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] +
\"','main')>\" + parent.contents.procDisplay[i] +
\"</a>\"</a></td></tr></table></body></html>";

Parser parser = new Parser(testHtml);
NodeList originalPage = parser.parse(null);
NodeFilter filter = new AndFilter(new TagNameFilter("a"),
               new HasParentFilter(new TagNameFilter("body"),true));
NodeList extract = originalPage.extractAllNodesThatMatch(filter, true);

This picks up the second JAVASCRIPT LinkTag - the one outside the
<head> tag, but inside the document.write(). When I try to evaluate
LinkTag.getLinkTag() against this, HtmlParser is reporting the text as
JAVASCRIPT:openProc('" which is not correct. Any ideas?

Regards

Re: [Htmlparser-user] Newbie Problem with HasChildFilter

From: Derrick O. <der...@gm...> - 2009-07-07 15:12:36

My gut reaction without even looking into it in detail because it is a
javascript problem is to tell you to set
org.htmlparser.scanners.ScriptScanner.STRICT = false and try it again.

On Tue, Jul 7, 2009 at 2:03 PM, Roger Varley <rog...@go...>wrote:

> >
> > If you want the A tags insid the BODY tag it would be:
> > NodeFilter filter = new AndFilter(new TagNameFilter("A"),
> >                new HasParentFilter(new TagNameFilter("BODY"),true));
> >
>
> Thanks Derek, that worked perfectly. I've now got another problem that
> I think might be a bug. With the testcase
> (I'm not making this up - I've actually got work with this sort of stuff!)
>
>  String testHtml = "<html><head><script><a
> href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] +
> \"','main')>"
>                              +"</script><body><table><tr><td><img
> src=/666.jpg\"></td></tr><tr><td>"
>                             +"document.write(\"<a
> href=JAVASCRIPT:openProc('\" + parent.contents.procUID[i] +
> \"','main')>\" + parent.contents.procDisplay[i] +
> \"</a>\"</a></td></tr></table></body></html>";
>
> Parser parser = new Parser(testHtml);
> NodeList originalPage = parser.parse(null);
> NodeFilter filter = new AndFilter(new TagNameFilter("a"),
>               new HasParentFilter(new TagNameFilter("body"),true));
> NodeList extract = originalPage.extractAllNodesThatMatch(filter, true);
>
> This picks up the second JAVASCRIPT LinkTag - the one outside the
> <head> tag, but inside the document.write(). When I try to evaluate
> LinkTag.getLinkTag() against this, HtmlParser is reporting the text as
> JAVASCRIPT:openProc('" which is not correct. Any ideas?
>
> Regards
>
>
> ------------------------------------------------------------------------------
> Enter the BlackBerry Developer Challenge
> This is your chance to win up to $100,000 in prizes! For a limited time,
> vendors submitting new applications to BlackBerry App World(TM) will have
> the opportunity to enter the BlackBerry Developer Challenge. See full prize
> details at: http://p.sf.net/sfu/blackberry
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>