Thread: [Htmlparser-user] Excluding some tags
Brought to you by:
derrickoswald
From: Manish K. <ma...@we...> - 2010-11-16 07:48:10
|
This indeed is a newbie question. I could not find a work around to exclude some tags (<script> in my case) while parsing. I tried using the NotFilter as underneath, but it didn't work as I got all the <script> tags in my NodeList - > NotFilter noScriptFilter = new NotFilter(); > noScriptFilter.setPredicate(new NodeFilter(){ > public boolean accept(Node currNode){ > if(currNode instanceof TagNode){ > if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ > return true; > } > } > return false; > } > }); > NodeList allNodes = this.parser.parse(noScriptFilter); > Would appreciate if someone can guide me throgh this. Thanks Manish |
From: Derrick O. <der...@gm...> - 2010-11-16 18:09:36
|
Although the filter is correct, the tag enclosing the <script> tag is accepted, and with it it's child tags - including the <script> tag. Maybe a way to do it is to override the ScriptTag class with MyScriptTag so that it prints nothing in the toHtml () call. Add the overridden class to the PrototypicalNodeFactory as described here: http://htmlparser.sourceforge.net/faq.html#composite, and then get all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml ()); On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: > This indeed is a newbie question. I could not find a work around to exclude > some tags (<script> in my case) while parsing. > > I tried using the NotFilter as underneath, but it didn't work as I got all > the <script> tags in my NodeList - > >> NotFilter noScriptFilter = new NotFilter(); >> noScriptFilter.setPredicate(new NodeFilter(){ >> public boolean accept(Node currNode){ >> if(currNode instanceof TagNode){ >> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >> return true; >> } >> } >> return false; >> } >> }); >> NodeList allNodes = this.parser.parse(noScriptFilter); >> > > Would appreciate if someone can guide me throgh this. > > Thanks > Manish > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Manish K. <ma...@we...> - 2010-11-17 06:44:40
|
Thanks for the revert Derrick. So, here's the real problem - I do want to retain the script tag. At the same time, I want to override all the links in the page. The parser doesn't play nice. Consider the scenario underneath for an html <script> > document.write("<a href='/jslink'>JS Link</a>") > </script> > <a href="/somelink">Some link</a> > To me the string literal inside script tag above is not a link at all. However, when I try to fetch all the <a> using the parser it would give me both of the above. Is there a way to not get the <a>s which are not in the <script> tag? Thanks Manish On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm...>wrote: > Although the filter is correct, the tag enclosing the <script> tag is > accepted, and with it it's child tags - including the <script> tag. > Maybe a way to do it is to override the ScriptTag class with MyScriptTag so > that it prints nothing in the toHtml () call. > Add the overridden class to the PrototypicalNodeFactory as described > here: http://htmlparser.sourceforge.net/faq.html#composite, and then get > all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml > ()); > > On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: > >> This indeed is a newbie question. I could not find a work around to >> exclude some tags (<script> in my case) while parsing. >> >> I tried using the NotFilter as underneath, but it didn't work as I got all >> the <script> tags in my NodeList - >> >>> NotFilter noScriptFilter = new NotFilter(); >>> noScriptFilter.setPredicate(new NodeFilter(){ >>> public boolean accept(Node currNode){ >>> if(currNode instanceof TagNode){ >>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>> return true; >>> } >>> } >>> return false; >>> } >>> }); >>> NodeList allNodes = this.parser.parse(noScriptFilter); >>> >> >> Would appreciate if someone can guide me throgh this. >> >> Thanks >> Manish >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Manish K. <ma...@we...> - 2010-11-17 06:46:34
|
Sorry i modify my question ignore the previous one. Is there a way to get the <a>s which are not in the <script> tag? Thanks, MAnish On Wed, Nov 17, 2010 at 12:14 PM, Manish Kashyap <ma...@we...>wrote: > Thanks for the revert Derrick. So, here's the real problem - > I do want to retain the script tag. At the same time, I want to override > all the links in the page. The parser doesn't play nice. Consider the > scenario underneath for an html > > <script> >> document.write("<a href='/jslink'>JS Link</a>") >> </script> >> <a href="/somelink">Some link</a> >> > > To me the string literal inside script tag above is not a link at all. > However, when I try to fetch all the <a> using the parser it would give me > both of the above. Is there a way to not get the <a>s which are not in the > <script> tag? > > Thanks > Manish > > > On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm... > > wrote: > >> Although the filter is correct, the tag enclosing the <script> tag is >> accepted, and with it it's child tags - including the <script> tag. >> Maybe a way to do it is to override the ScriptTag class with MyScriptTag >> so that it prints nothing in the toHtml () call. >> Add the overridden class to the PrototypicalNodeFactory as described >> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get >> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml >> ()); >> >> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: >> >>> This indeed is a newbie question. I could not find a work around to >>> exclude some tags (<script> in my case) while parsing. >>> >>> I tried using the NotFilter as underneath, but it didn't work as I got >>> all the <script> tags in my NodeList - >>> >>>> NotFilter noScriptFilter = new NotFilter(); >>>> noScriptFilter.setPredicate(new NodeFilter(){ >>>> public boolean accept(Node currNode){ >>>> if(currNode instanceof TagNode){ >>>> >>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>>> return true; >>>> } >>>> } >>>> return false; >>>> } >>>> }); >>>> NodeList allNodes = this.parser.parse(noScriptFilter); >>>> >>> >>> Would appreciate if someone can guide me throgh this. >>> >>> Thanks >>> Manish >>> >>> >>> ------------------------------------------------------------------------------ >>> Beautiful is writing same markup. Internet Explorer 9 supports >>> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >>> Spend less time writing and rewriting code and more time creating great >>> experiences on the web. Be a part of the beta today >>> http://p.sf.net/sfu/msIE9-sfdev2dev >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > |
From: Derrick O. <der...@gm...> - 2010-11-17 17:35:31
|
Tgat's not valid HTML. You'll want to turn strict script scanning off then. On Wed, Nov 17, 2010 at 7:44 AM, Manish Kashyap <ma...@we...>wrote: > Thanks for the revert Derrick. So, here's the real problem - > I do want to retain the script tag. At the same time, I want to override > all the links in the page. The parser doesn't play nice. Consider the > scenario underneath for an html > > <script> >> document.write("<a href='/jslink'>JS Link</a>") >> </script> >> <a href="/somelink">Some link</a> >> > > To me the string literal inside script tag above is not a link at all. > However, when I try to fetch all the <a> using the parser it would give me > both of the above. Is there a way to not get the <a>s which are not in the > <script> tag? > > Thanks > Manish > > On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm... > > wrote: > >> Although the filter is correct, the tag enclosing the <script> tag is >> accepted, and with it it's child tags - including the <script> tag. >> Maybe a way to do it is to override the ScriptTag class with MyScriptTag >> so that it prints nothing in the toHtml () call. >> Add the overridden class to the PrototypicalNodeFactory as described >> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get >> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml >> ()); >> >> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: >> >>> This indeed is a newbie question. I could not find a work around to >>> exclude some tags (<script> in my case) while parsing. >>> >>> I tried using the NotFilter as underneath, but it didn't work as I got >>> all the <script> tags in my NodeList - >>> >>>> NotFilter noScriptFilter = new NotFilter(); >>>> noScriptFilter.setPredicate(new NodeFilter(){ >>>> public boolean accept(Node currNode){ >>>> if(currNode instanceof TagNode){ >>>> >>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>>> return true; >>>> } >>>> } >>>> return false; >>>> } >>>> }); >>>> NodeList allNodes = this.parser.parse(noScriptFilter); >>>> >>> >>> Would appreciate if someone can guide me throgh this. >>> >>> Thanks >>> Manish >>> >>> >>> ------------------------------------------------------------------------------ >>> Beautiful is writing same markup. Internet Explorer 9 supports >>> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >>> Spend less time writing and rewriting code and more time creating great >>> experiences on the web. Be a part of the beta today >>> http://p.sf.net/sfu/msIE9-sfdev2dev >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |