htmlparser-user Mailing List for HTML Parser (Page 12)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: mani k. <maj...@gm...> - 2011-01-13 06:38:44
|
Hi, I am new to the HTML parser. I am trying to use the parse to grab all the images referred in the page. can you please help me how to do that. ? -- ALWAYS KEEP SMILING FOR U EVER, G.MANIKANDAN |
From: Derrick O. <der...@gm...> - 2010-12-15 18:44:01
|
If you want these tags to contain children you have to do more work and make them into composite tags: see http://htmlparser.sourceforge.net/faq.html#composite On Wed, Dec 15, 2010 at 11:58 AM, Francesco Fontana < fra...@gm...> wrote: > Thank you for the answer, I've tried, but nothing happens.. > The program finds all the right Nodes (with upper and with lower > case), but everyone has node.getChildren()==NULL... > > > Try upper case tag names. > > NodeFilter singleFieldFilter=new TagNameFilter("FIELD"); > > NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST"); > > > Thanks a lot, any more suggestion? > > Francesco > > > ------------------------------------------------------------------------------ > Lotusphere 2011 > Register now for Lotusphere 2011 and learn how > to connect the dots, take your collaborative environment > to the next level, and enter the era of Social Business. > http://p.sf.net/sfu/lotusphere-d2d > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Francesco F. <fra...@gm...> - 2010-12-15 10:58:40
|
Thank you for the answer, I've tried, but nothing happens.. The program finds all the right Nodes (with upper and with lower case), but everyone has node.getChildren()==NULL... > Try upper case tag names. > NodeFilter singleFieldFilter=new TagNameFilter("FIELD"); > NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST"); Thanks a lot, any more suggestion? Francesco |
From: Derrick O. <der...@gm...> - 2010-12-14 17:55:06
|
Try upper case tag names. NodeFilter singleFieldFilter=new TagNameFilter("FIELD"); NodeFilter multipleFieldFilter=new TagNameFilter("FIELD_LIST"); On Tue, Dec 14, 2010 at 11:18 AM, Francesco Fontana < fra...@gm...> wrote: > Hi, > I'm trying to parse an xml file, but I receive the message "WARNING: URL > [filename] does not contain text"... > When I put a watch on a filter, every node found has id and attributes, but > the children still null... > > The code is really simple, the xml is well formed and have a dtd. Someone > knows what I'm doing wrong? > Thank you very much, > Francesco > > ------------------------- java code -------------------------- > public FilterSet setBaseValues(String siteXMLName) throws > ParserException { > NodeFilter singleFieldFilter=new TagNameFilter("field"); > NodeFilter multipleFieldFilter=new TagNameFilter("field_list"); > NodeList singleFieldList=new NodeList(); > NodeList multipleFieldList=new NodeList(); > > Parser parser=new Parser("./"+siteXMLName); > for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { > Node node = e.nextNode(); > node.collectInto(singleFieldList, singleFieldFilter); > node.collectInto(multipleFieldList, multipleFieldFilter); > } > } > ------------------------------------------------------------------ > > ------------------------- try.xml content ---------------------- > <?xml version="1.0"?> > <!DOCTYPE site SYSTEM "hsh.dtd"> > <site> > <field id="image" type="attribute"> > <caption>a caption</caption> > <attribute>src</attribute> > </field> > <field id="description" type="text"> > <caption>text text</caption> > </field> > <field_list list="parent_name" type="text"> > <caption>another caption</caption> > <names> > <name id="field1">Field 1</name> > <name id="field2">Field 2</name> > </names> > </field_list> > </site> > ---------------------------------------------------------------- > > > ------------------------------------------------------------------------------ > Lotusphere 2011 > Register now for Lotusphere 2011 and learn how > to connect the dots, take your collaborative environment > to the next level, and enter the era of Social Business. > http://p.sf.net/sfu/lotusphere-d2d > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Francesco F. <fra...@gm...> - 2010-12-14 10:18:58
|
Hi, I'm trying to parse an xml file, but I receive the message "WARNING: URL [filename] does not contain text"... When I put a watch on a filter, every node found has id and attributes, but the children still null... The code is really simple, the xml is well formed and have a dtd. Someone knows what I'm doing wrong? Thank you very much, Francesco ------------------------- java code -------------------------- public FilterSet setBaseValues(String siteXMLName) throws ParserException { NodeFilter singleFieldFilter=new TagNameFilter("field"); NodeFilter multipleFieldFilter=new TagNameFilter("field_list"); NodeList singleFieldList=new NodeList(); NodeList multipleFieldList=new NodeList(); Parser parser=new Parser("./"+siteXMLName); for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { Node node = e.nextNode(); node.collectInto(singleFieldList, singleFieldFilter); node.collectInto(multipleFieldList, multipleFieldFilter); } } ------------------------------------------------------------------ ------------------------- try.xml content ---------------------- <?xml version="1.0"?> <!DOCTYPE site SYSTEM "hsh.dtd"> <site> <field id="image" type="attribute"> <caption>a caption</caption> <attribute>src</attribute> </field> <field id="description" type="text"> <caption>text text</caption> </field> <field_list list="parent_name" type="text"> <caption>another caption</caption> <names> <name id="field1">Field 1</name> <name id="field2">Field 2</name> </names> </field_list> </site> ---------------------------------------------------------------- |
From: Derrick O. <der...@gm...> - 2010-11-17 17:35:31
|
Tgat's not valid HTML. You'll want to turn strict script scanning off then. On Wed, Nov 17, 2010 at 7:44 AM, Manish Kashyap <ma...@we...>wrote: > Thanks for the revert Derrick. So, here's the real problem - > I do want to retain the script tag. At the same time, I want to override > all the links in the page. The parser doesn't play nice. Consider the > scenario underneath for an html > > <script> >> document.write("<a href='/jslink'>JS Link</a>") >> </script> >> <a href="/somelink">Some link</a> >> > > To me the string literal inside script tag above is not a link at all. > However, when I try to fetch all the <a> using the parser it would give me > both of the above. Is there a way to not get the <a>s which are not in the > <script> tag? > > Thanks > Manish > > On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm... > > wrote: > >> Although the filter is correct, the tag enclosing the <script> tag is >> accepted, and with it it's child tags - including the <script> tag. >> Maybe a way to do it is to override the ScriptTag class with MyScriptTag >> so that it prints nothing in the toHtml () call. >> Add the overridden class to the PrototypicalNodeFactory as described >> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get >> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml >> ()); >> >> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: >> >>> This indeed is a newbie question. I could not find a work around to >>> exclude some tags (<script> in my case) while parsing. >>> >>> I tried using the NotFilter as underneath, but it didn't work as I got >>> all the <script> tags in my NodeList - >>> >>>> NotFilter noScriptFilter = new NotFilter(); >>>> noScriptFilter.setPredicate(new NodeFilter(){ >>>> public boolean accept(Node currNode){ >>>> if(currNode instanceof TagNode){ >>>> >>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>>> return true; >>>> } >>>> } >>>> return false; >>>> } >>>> }); >>>> NodeList allNodes = this.parser.parse(noScriptFilter); >>>> >>> >>> Would appreciate if someone can guide me throgh this. >>> >>> Thanks >>> Manish >>> >>> >>> ------------------------------------------------------------------------------ >>> Beautiful is writing same markup. Internet Explorer 9 supports >>> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >>> Spend less time writing and rewriting code and more time creating great >>> experiences on the web. Be a part of the beta today >>> http://p.sf.net/sfu/msIE9-sfdev2dev >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Manish K. <ma...@we...> - 2010-11-17 06:46:34
|
Sorry i modify my question ignore the previous one. Is there a way to get the <a>s which are not in the <script> tag? Thanks, MAnish On Wed, Nov 17, 2010 at 12:14 PM, Manish Kashyap <ma...@we...>wrote: > Thanks for the revert Derrick. So, here's the real problem - > I do want to retain the script tag. At the same time, I want to override > all the links in the page. The parser doesn't play nice. Consider the > scenario underneath for an html > > <script> >> document.write("<a href='/jslink'>JS Link</a>") >> </script> >> <a href="/somelink">Some link</a> >> > > To me the string literal inside script tag above is not a link at all. > However, when I try to fetch all the <a> using the parser it would give me > both of the above. Is there a way to not get the <a>s which are not in the > <script> tag? > > Thanks > Manish > > > On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm... > > wrote: > >> Although the filter is correct, the tag enclosing the <script> tag is >> accepted, and with it it's child tags - including the <script> tag. >> Maybe a way to do it is to override the ScriptTag class with MyScriptTag >> so that it prints nothing in the toHtml () call. >> Add the overridden class to the PrototypicalNodeFactory as described >> here: http://htmlparser.sourceforge.net/faq.html#composite, and then get >> all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml >> ()); >> >> On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: >> >>> This indeed is a newbie question. I could not find a work around to >>> exclude some tags (<script> in my case) while parsing. >>> >>> I tried using the NotFilter as underneath, but it didn't work as I got >>> all the <script> tags in my NodeList - >>> >>>> NotFilter noScriptFilter = new NotFilter(); >>>> noScriptFilter.setPredicate(new NodeFilter(){ >>>> public boolean accept(Node currNode){ >>>> if(currNode instanceof TagNode){ >>>> >>>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>>> return true; >>>> } >>>> } >>>> return false; >>>> } >>>> }); >>>> NodeList allNodes = this.parser.parse(noScriptFilter); >>>> >>> >>> Would appreciate if someone can guide me throgh this. >>> >>> Thanks >>> Manish >>> >>> >>> ------------------------------------------------------------------------------ >>> Beautiful is writing same markup. Internet Explorer 9 supports >>> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >>> Spend less time writing and rewriting code and more time creating great >>> experiences on the web. Be a part of the beta today >>> http://p.sf.net/sfu/msIE9-sfdev2dev >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > |
From: Manish K. <ma...@we...> - 2010-11-17 06:44:40
|
Thanks for the revert Derrick. So, here's the real problem - I do want to retain the script tag. At the same time, I want to override all the links in the page. The parser doesn't play nice. Consider the scenario underneath for an html <script> > document.write("<a href='/jslink'>JS Link</a>") > </script> > <a href="/somelink">Some link</a> > To me the string literal inside script tag above is not a link at all. However, when I try to fetch all the <a> using the parser it would give me both of the above. Is there a way to not get the <a>s which are not in the <script> tag? Thanks Manish On Tue, Nov 16, 2010 at 11:39 PM, Derrick Oswald <der...@gm...>wrote: > Although the filter is correct, the tag enclosing the <script> tag is > accepted, and with it it's child tags - including the <script> tag. > Maybe a way to do it is to override the ScriptTag class with MyScriptTag so > that it prints nothing in the toHtml () call. > Add the overridden class to the PrototypicalNodeFactory as described > here: http://htmlparser.sourceforge.net/faq.html#composite, and then get > all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml > ()); > > On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: > >> This indeed is a newbie question. I could not find a work around to >> exclude some tags (<script> in my case) while parsing. >> >> I tried using the NotFilter as underneath, but it didn't work as I got all >> the <script> tags in my NodeList - >> >>> NotFilter noScriptFilter = new NotFilter(); >>> noScriptFilter.setPredicate(new NodeFilter(){ >>> public boolean accept(Node currNode){ >>> if(currNode instanceof TagNode){ >>> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >>> return true; >>> } >>> } >>> return false; >>> } >>> }); >>> NodeList allNodes = this.parser.parse(noScriptFilter); >>> >> >> Would appreciate if someone can guide me throgh this. >> >> Thanks >> Manish >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@gm...> - 2010-11-16 18:09:36
|
Although the filter is correct, the tag enclosing the <script> tag is accepted, and with it it's child tags - including the <script> tag. Maybe a way to do it is to override the ScriptTag class with MyScriptTag so that it prints nothing in the toHtml () call. Add the overridden class to the PrototypicalNodeFactory as described here: http://htmlparser.sourceforge.net/faq.html#composite, and then get all tags and print the whole thing with System.out.println (this.parser.parse(null).toHtml ()); On Tue, Nov 16, 2010 at 8:19 AM, Manish Kashyap <ma...@we...>wrote: > This indeed is a newbie question. I could not find a work around to exclude > some tags (<script> in my case) while parsing. > > I tried using the NotFilter as underneath, but it didn't work as I got all > the <script> tags in my NodeList - > >> NotFilter noScriptFilter = new NotFilter(); >> noScriptFilter.setPredicate(new NodeFilter(){ >> public boolean accept(Node currNode){ >> if(currNode instanceof TagNode){ >> if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ >> return true; >> } >> } >> return false; >> } >> }); >> NodeList allNodes = this.parser.parse(noScriptFilter); >> > > Would appreciate if someone can guide me throgh this. > > Thanks > Manish > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Manish K. <ma...@we...> - 2010-11-16 07:48:10
|
This indeed is a newbie question. I could not find a work around to exclude some tags (<script> in my case) while parsing. I tried using the NotFilter as underneath, but it didn't work as I got all the <script> tags in my NodeList - > NotFilter noScriptFilter = new NotFilter(); > noScriptFilter.setPredicate(new NodeFilter(){ > public boolean accept(Node currNode){ > if(currNode instanceof TagNode){ > if(((TagNode)currNode).getRawTagName().equalsIgnoreCase("script")){ > return true; > } > } > return false; > } > }); > NodeList allNodes = this.parser.parse(noScriptFilter); > Would appreciate if someone can guide me throgh this. Thanks Manish |
From: Derrick O. <der...@gm...> - 2010-10-13 05:33:41
|
If you set the document base href on the page (see how BaseHrefTag handles it in doSemanticAction, basically page.setBaseUrl (base)), then the links you get back can be 'canonized' as you call it by using the page getAbsoluteURL (String link, boolean strict) method. On Tue, Oct 12, 2010 at 10:50 PM, Santiago Basulto < san...@gm...> wrote: > Hello people. > > I'm starting with HTMLParser. It seems a great library. I've doing > some benchmarking and runs really fast. > > Now i'm trying to improve it a little bit. > > In my software, i use something like this to extract all links: > > public class LinkVisitor extends NodeVisitor { > private Set<String> links = new HashSet<String>(100); > public LinkVisitor(){ > } > public void visitTag(Tag tag) { > String name = tag.getTagName(); > if ("a".equalsIgnoreCase(name)){ > String hrefValue = tag.getAttribute("href"); > links.add(tag.getAttribute("href")); > } > } > public Set<String> getLinks(){ > return this.urls; > } > > } > > But, reading a little bit i found other classes that may help, but > don't know how to use them. Can anyone help me out? > > The idea is to extract all the links from a String (that contains an > HTML page already read from an URLConnection). Is there anyway to > "Canonize" them? I mean, if the href says "/food/fruits/2" convert it > to "http://www.foodsite.com/home/fruits/2"? > > > Thanks a lot! > > -- > Santiago Basulto.- > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Stanislav O. <orl...@gm...> - 2010-10-12 21:13:38
|
Hi You may try to use filters (org.htmlparser.filters). In this way you'll get all link tags from the page: Parser parser = parserMain.getParser(parseURL); NodeList links = null; try { links = parser.parse(new TagNameFilter("a")); } catch (ParserException ex) { logger.error(null, ex); } for (SimpleNodeIterator sni = links.element(); sni.hasMoreNodes();) { Node node = sni.nextNode(); if (node instanceof LinkTag) { LinkTag lt = (LinkTag) node; // link text - lt.getLinkText() // link href - lt.getLink() } } On Tue, 2010-10-12 at 17:50 -0300, Santiago Basulto wrote: > Hello people. > > I'm starting with HTMLParser. It seems a great library. I've doing > some benchmarking and runs really fast. > > Now i'm trying to improve it a little bit. > > In my software, i use something like this to extract all links: > > public class LinkVisitor extends NodeVisitor { > private Set<String> links = new HashSet<String>(100); > public LinkVisitor(){ > } > public void visitTag(Tag tag) { > String name = tag.getTagName(); > if ("a".equalsIgnoreCase(name)){ > String hrefValue = tag.getAttribute("href"); > links.add(tag.getAttribute("href")); > } > } > public Set<String> getLinks(){ > return this.urls; > } > > } > > But, reading a little bit i found other classes that may help, but > don't know how to use them. Can anyone help me out? > > The idea is to extract all the links from a String (that contains an > HTML page already read from an URLConnection). Is there anyway to > "Canonize" them? I mean, if the href says "/food/fruits/2" convert it > to "http://www.foodsite.com/home/fruits/2"? > > > Thanks a lot! > |
From: Santiago B. <san...@gm...> - 2010-10-12 20:50:40
|
Hello people. I'm starting with HTMLParser. It seems a great library. I've doing some benchmarking and runs really fast. Now i'm trying to improve it a little bit. In my software, i use something like this to extract all links: public class LinkVisitor extends NodeVisitor { private Set<String> links = new HashSet<String>(100); public LinkVisitor(){ } public void visitTag(Tag tag) { String name = tag.getTagName(); if ("a".equalsIgnoreCase(name)){ String hrefValue = tag.getAttribute("href"); links.add(tag.getAttribute("href")); } } public Set<String> getLinks(){ return this.urls; } } But, reading a little bit i found other classes that may help, but don't know how to use them. Can anyone help me out? The idea is to extract all the links from a String (that contains an HTML page already read from an URLConnection). Is there anyway to "Canonize" them? I mean, if the href says "/food/fruits/2" convert it to "http://www.foodsite.com/home/fruits/2"? Thanks a lot! -- Santiago Basulto.- |
From: TopPfizer's P. <htm...@li...> - 2010-08-30 10:48:45
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> <title>Newsletter</title> </head> <body> <table width="620" cellpadding="0" cellspacing="0" align="center"> <tr> <td> <div style="text-align: center"> <font size="1" face="arial"><a href="http://as.portfinger.ru/?election=Ff7b614C252">View as a web page</a></font><br><br> <a href="http://on.portfinger.ru/?Law=27C91524D4"> <img alt="Unable to view this image? Click here" src="http://Gazette.portfinger.ru/at.gif" style="border-width: 0px"></a><br> <a href="http://writers.Europe.com/that/parallel.php?Nigeria=3Dc8FFAeD3"> <img alt="" src="http://his.flag.com/Dentistry/picture.gif" style="border-width: 0px"></a><br> <a href="http://the.or.com/was/English.php?industrial=620aD72ef63"> <img alt="" src="http://aircraft.related.com/Cuban/The.gif" style="border-width: 0px"></a><br> <a href="http://December.is.com/of/the.php?uninhabited=4A7fA3c34E1"> <img alt="" src="http://same.in.com/scavenged/for.gif" style="border-width: 0px"></a><br> <a href="http://the.the.com/political/homes.php?median=575CF17a1A"> <img alt="" src="http://Saxon.overall.com/by/ed.gif" style="border-width: 0px"></a><br> <a href="http://for.Germanic.com/use/Great.php?in=F4bFd7783f9"> <img alt="" src="http://being.MSA.com/an/customs.gif" style="border-width: 0px"></a><br> <span style="color: #EEE2E2; font-size: xx-small; font-family: Arial, Helvetica, sans-serif"> The latter include fish-canning and meat-processing plants in the northern regions, as well as about 25 factories in the Mogadishu area, which manufacture pasta, mineral water, confections, plastic bags, fabric, hides and skins, detergent and soap, aluminum, foam mattresses and pillows, fishing boats, carry out packaging, and stone processing.<br> Bureau of Democracy, Human Rights, and Labor (2006-09-15).<br> Alfred Knopf retired in 1972, becoming chairman emeritus of the firm until his death in 1984.<br> The war was the largest and most destructive in human history, with 60million dead across the world.<br> <img alt="" src="http://George.students.com/May/laws.gif" style="border-width: 0px"> Luce and His Empire (1972), outdated popular history.<br> Film industry has largely been based in and around Hollywood, California.<br> Displaying available languages on a multilingual website or software.<br> <img alt="" src="http://Channel.David.com/which/for.gif" style="border-width: 0px"> Guillaume du Bellay, writer and general.<br> The current Constitution of Florida was ratified on November 5, 1968.<br> Census Bureau, Population Division.<br> President of the Executive Council.<br> Journalistic accounts and televised footage of the daily deprivation and indignities suffered by southern blacks, and of segregationist violence and harassment of civil rights workers and marchers, produced a wave of sympathetic public opinion that convinced the majority of Americans that the Civil Rights Movement was the most important issue in American politics in the early 1960s.<br> <img alt="" src="http://The.In.com/to/flags.gif" style="border-width: 0px"> It requires a cadmium atom to capture sufficient neutrons and then undergo Beta decay.<br> King believed that organized, nonviolent protest against the system of southern segregation known as Jim Crow laws would lead to extensive media coverage of the struggle for black equality and voting rights.<br> New York Film Critics Circle Award for Best Actress.<br> Had children under the age of 18 living with them, 36.<br> </span> </div> <hr></td> </tr> <tr> <td><font size="1" face="arial">This e-mail message was sent to: htm...@li... <p> <a href="http://were.portfinger.ru/?her=cE0896f323A">Unsubsribe</a></p> <p> (c) 2007 of Lawrence A <a href="http://Eight.portfinger.ru/?considered=90ae2ECfa83">Privacy Statement</a>.<br> All rights reserved.</p> </font> </body> </HTML> |
From: <aje...@as...> - 2010-08-26 09:53:33
|
Love her with all your heart and might, and also every inch of your rod. http://www.pooldeal.ru/ |
From: Derrick O. <der...@gm...> - 2010-08-08 17:13:56
|
The same constructor for the Parser that takes a string [Parser (String resource) and Parser (String resource, ParserFeedback feedback)] checks for a string that starts with an angle bracket ('<') and if so it assumes it is HTML - otherwise it is assumes to be some sort of URL. If you already have a parser you can use setResource and pass it the HTML, since this is the same mechanism the constructor uses... /** * Set the html, a url, or a file. * @param resource The resource to use. * @exception IllegalArgumentException if <code>resource</code> is <code>null</code>. * @exception ParserException if a problem occurs in connecting. */ public void setResource (String resource) On Sun, Aug 1, 2010 at 8:57 AM, Mohammad Waqar <waq...@gm...>wrote: > how can i parse an HTML string stored in a variable? > > Vakar > > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: AmericanViagra on-l. <htm...@li...> - 2010-08-03 14:32:28
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <title>no made against portability Sir of in Newsletter</title> </head> <body> <table style="width: 700px;" align="center" cellspacing="0" cellpadding="0"> <tr> <td style="font-family: Arial, Helvetica, sans-serif; font-size: x-small; text-align: center;"> If you are unable to see the message below, <a href="http://diimpulsion.co.cc/chalk36.html"> click here</a> to view.</td> </tr> <tr> <td style="text-align: center"> <br /> <a href="http://diimpulsion.co.cc/chalk36.html"><img src="http://diimpulsion.co.cc/chalk36.jpg" style="border:0px" alt="Click here to see graphics" /></a></td> </tr> <tr> <td style="font-size: x-small; color: #F0F0F0"> <img src="http://Stephanie.a.com/instead/Oct.jpg" style="border:0px" alt="" /> <br /> The Istrian Y highway is being built in two phases. Quisling, as minister president, later formed a collaborationist government under German control. The Northern Territory was founded in 1911 when it was excised from South Australia.<strong>Mother Angelica</strong> Awarded Top Honor by Pope Benedict XVI. See a complete list of group identifiers. Com, struck a deal with Col Needham and other principal shareholders to buy IMDb outright and attach it to Amazon as a subsidiary, private company. Melbourne, The Macquarie Library Pty Ltd. Television stations in North Platte.Department of the Environment and Heritage, Australian Government.Its diverse geography ranges <center>from</center> the mountainous regions of the Ozarks and the Ouachita Mountains, which make up the U. May 2, 2005, Media Matters for America. Paris is de facto capital of France (seat of the Presidency, the Government, the National Assembly and the Senate), but the parliament holds its joint congresses in Versailles. Jim Mecir <center>- Major</center> League Baseball player. Web Site Design and Hosting by LogicalSolutions.Theoretical models of human development.The 4th District is currently represented by Republican Steve Buyer.It should be pronounced in three (3) syllables, with the final "s" silent, the "a" in each syllable with the Italian sound, and the accent on the first and last syllables. The second district lies in north-central part of the state and includes all of LaPorte, St. This disambiguation page lists articles associated with the same title. The activity is intrinsically rewarding, so there is an effortlessness of action. The west coast of southern Norway and the coast of northern Norway present some of the most visually impressive coastal sceneries in the world. Redistricted from the 10th district, Died. Olav at the Nidaros shrine, and with them, much of the contact with cultural and economic life in the rest of Europe. English language Broadcast television networks in the United States. Though for a time as a Kansas City team, the "A"s wore "Kansas City" on their road jerseys and an interlocking "KC" on the cap, upon moving to Oakland the "A" cap emblem was restored, although in 1970 an "apostrophe-s" was added to the cap and uniform emblem to reflect the fact that then-team owner Charles O. The population density of the state is 51. He also supported a constitutional amendment requiring a balanced federal budget. The first wave of Australian feature film production"". As the Danish kingdom found itself on the losing side in 1814, it was forced, under terms of the Treaty of Kiel, to cede Norway to the king of Sweden, while the old Danish-Norwegian provinces of Iceland, Greenland and the Faroe Islands remained with the Danish crown. Jeremy Wall - Founding pianist of the Jazz Fusion Band Spyro Gyra. A former United States Representative, Hindman commanded Confederate forces at the Battle of Cane Hill and Battle of Prairie Grove. Australia has a free-market economy with high GDP per capita and low rate of poverty. As such, Norway is fundamentally structured as a representative democracy. This United States Congress -related article is a stub. In unitary states, "administrative center" or other similar terms are typically used. Rondeslottet in Rondane National Park, Eastern Norway. </td> </tr> <tr> <td style="font-family: Arial, Helvetica, sans-serif; font-size: small"> <br /> © 2009 Macintyre Inc. All rights reserved.<br /> <br /> <a href="http://diimpulsion.co.cc/chalk36.html">Unsubscribe</a></td> </tr> </table> <br /> </body> </html> |
From: Mohammad W. <waq...@gm...> - 2010-08-01 06:57:36
|
how can i parse an HTML string stored in a variable? Vakar |
From: Johann H. <h.h...@ic...> - 2010-07-28 12:47:26
|
Hello community, I am writing a website parser with htmlparser and I think it's a great library. My problem is, the website I'm parsing shows me a captcha after a certain number of crawls. As a workaround I wrote a redial routine to reconnect my router and get a new ip. That is working quite well, but my problem is, that my jvm seems to cache DNS. I read this post http://forum.vis.ethz.ch/showthread.php?t=13457 and applied everything which is supposed there, but still I can't continue parsing after a reconnect and I get a ConnectionTimeoutException from htmlparser. It seems, that there might still be some kind of cache. Could anybody tell me, how I can get the new instance of Parser to connect after a reconnect. Thank you. Hans. |
From: geeraza <nor...@ne...> - 2010-07-27 16:41:37
|
Hi, Since Tuesday 20 July 2010, you have been invited by 1 of your contacts to join Netlog, the social community for over 49 million young people. [---- Invitation from geeraza ---- ] 34 yrs - male - Baden-Wurttemberg (Germany) Connect with geeraza: http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0MQ__ On Netlog you can: - Create your own web page - Extend your social network - Publish your music playlists- Share pictures and videos- Post blogs - And much more ... .... http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0OTg0MDgyMjYyMQ__ ---------------------------------------------------------------- Don't want to receive invitations from your friends anymore? http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0zJmdtPTE2JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NQ__ ---------------------------------------------------------------- Netlog NV/SA. E. Braunplein 18. B-9000 Gent. Belgium BE0859635972. abu...@ne... |
From: geeraza <nor...@ne...> - 2010-07-20 15:46:13
|
Hey, I have created a Netlog profile with my pictures, videos, blogs and events and I want to add you as a friend so you can see it. You first need to register on Netlog! When you log in, you can create your own profile. Take a look: http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0xJmdtPTM3JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTE_ Cheers, geeraza ---------------------------------------------------------------- Don't want to receive invitations from your friends anymore? http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0yJmdtPTM3JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTI_ |
From: Derrick O. <der...@gm...> - 2010-07-08 04:38:36
|
Did you set STRICT false: http://htmlparser.sourceforge.net/javadoc/org/htmlparser/scanners/ScriptScanner.html On Wed, Jul 7, 2010 at 9:48 PM, Niket Arora <nik...@ex...>wrote: > I m parsing a page > http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using > htmlparser api and I m getting content inside a script tag in some other tag > and reason for this is html tags are present in a string inside javascript > tags and are not escaped …. so htmlparser api is closing on those tags. > > > > > > > ================================================================================================================================================================================================ > > > > <div id="myHealthlineHeader"> > > <script> > > if(isLoggedIn()) { > > document.write("<a href=\"/action/LogOutServlet\">Sign > Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My > Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>"); > > document.getElementById("myHealthlineHeader").className = > "hl_state_top_signed_in"; > > } else { > > > > document.write("<div > style=\"float:right;text-align:right;padding:0 5px 0 > 0;\"> | <a class=\"underlineless\" > rel=\"nofollow\" > href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>"); > > document.write("<div style=\"float:right\"><a > class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign > in</a> | <a class=\"underlineless\" > rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a> </div>") > > document.getElementById("myHealthlineHeader").className = > "hl_state_top"; > > } > > </script> > > </div> > > > > > ================================================================================================================================================================================================ > > > > Is there anyway to fix this issue? > > > > Regards > > Niket > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Niket A. <nik...@ex...> - 2010-07-07 20:07:05
|
I m parsing a page http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using htmlparser api and I m getting content inside a script tag in some other tag and reason for this is html tags are present in a string inside javascript tags and are not escaped .... so htmlparser api is closing on those tags. ================================================================================================================================================================================================ <div id="myHealthlineHeader"> <script> if(isLoggedIn()) { document.write("<a href=\"/action/LogOutServlet\">Sign Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>"); document.getElementById("myHealthlineHeader").className = "hl_state_top_signed_in"; } else { document.write("<div style=\"float:right;text-align:right;padding:0 5px 0 0;\"> | <a class=\"underlineless\" rel=\"nofollow\" href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>"); document.write("<div style=\"float:right\"><a class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign in</a> | <a class=\"underlineless\" rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a> </div>") document.getElementById("myHealthlineHeader").className = "hl_state_top"; } </script> </div> ================================================================================================================================================================================================ Is there anyway to fix this issue? Regards Niket |
From: Oliver S. <oli...@gm...> - 2010-07-05 16:31:02
|
Hi, I need to read arbitrary HTML (HTML 4 transitional, XHTML 1.0 strict, ...) extract the body as a fragment and output it again as another (XHTML standard). Reading the file is simple enough: Parser p = new Parser(resource); NodeFilter f = new NodeClassFilter(BodyTag.class); NodeList listOfBodies = p.extractAllNodesThatMatch(f); Node firstBody = listOfBodies.elementAt(0); NodeList bodyChildren = firstBody.getChildren(); System.out.println(bodyChildren.toHtml()); From this hpw can I output either valid HTML 4.0 code or valid XHTML 1.0 code? Best regards Oliver |
From: <sem...@ya...> - 2010-06-21 13:46:31
|
http://Werger2yfjh.servepics.com?Major-Blakelock |