htmlparser-user Mailing List for HTML Parser (Page 32)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Martin S. <mst...@gm...> - 2007-01-11 12:35:43
|
Hi, While I'm not a developer of HTMLParser, I think it is very unlikely that somewhere in the code a ParseException is printed to standard out. I use StringBean rather intensively in my application, but did not have this problem. The only thing I can think of is that your logger is printing the exception to standard out or something (you are passing the exception to the error method). Does the problem occur when you comment out the logger.error("Coul= d not parse url", e); ? -- Martin 2007/1/11, =D8ystein Lervik Larsen <oys...@me...>: > > Martin Sturm wrote: > > Hi, > > Good morning, thanks for your reply! > > > Are you sure you are running your own program? > > Not quite sure what you mean there... > > > StringExtractor is only a sample application for the StringBean, so it > > would be better to use the StringBean directly instead of > StringExtractor. > > I tried that too and the same thing happened, see below. > > > I did not experience your problem (I use StringBean rather intensively) > > and it would mean that the ParserException is catched somewhere in the > > HTMLParser codebase and printed directly to standard out. This is not > > very likely I think. > > That is exactly what I'm suspecting. Though unlikely - is it possible? > > > Maybe you can try to use StringBean directly insttead of > StringExtractor: > > > > String content =3D ""; > > try { > > StringBean sb; > > > > sb =3D new StringBean (); > > sb.setLinks (false); > > sb.setURL (url); > > content =3D sb.getStrings(); > > System.out.println(content); > > } catch (ParserException e) { > > logger.error("Could not parse url", e); > > System.out.println("test"); > > } > > I modified my code according to your example but removed the try/catch > and the exception was still written to the console the same way as it > used to. When I'm not catching the exception I guess it get caught some > place else? > > The exception occurs when the web server responds with http error code > 401 Unauthorized. > > I'm developing a web application using Tomcat and Spring framework if > that's relevant. > > > -=D8ystein > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Martin S. <mst...@gm...> - 2007-01-11 12:22:18
|
Hi, I did a quick test, and I think you forgot to add the endtag to the toCreate object. You should add: toCreate.setEndTag(new Span()); before you final println statement. The following code snippet works for me: TagNode toCreate = new Span(); toCreate.setAttribute("key", "Test", '"'); NodeList nl = new NodeList(); nl.add(new TextNode("Test2")); toCreate.setChildren(nl); toCreate.setEndTag(new Span()); System.out.println(toCreate.toHtml()); Result: <SPAN key="Test">Test2<SPAN> Hope this will help you. -- Martin 2007/1/10, Joel <jo...@ha...>: > > I want to wrap text string with a span tag. I've tried the folowing, but > I'm running into a problem, that the tag's children aren't being > displayed. > > //New <span key="x">some text here</span> > TagNode toCreate = new Span(); > toCreate.setAttribute("key", getKey(str), '"'); > NodeList nl = new NodeList(); > nl.add(new TextNode(str)); > toCreate.setChildren(nl); > System.out.println(toCreate.toHtml()); > > This ends up showing <span key="x"> without the text node and end tag, > what am I doing wrong? > > Joel > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: <oys...@me...> - 2007-01-11 09:19:40
|
Martin Sturm wrote: > Hi, Good morning, thanks for your reply! > Are you sure you are running your own program? Not quite sure what you mean there... > StringExtractor is only a sample application for the StringBean, so it > would be better to use the StringBean directly instead of StringExtractor. I tried that too and the same thing happened, see below. > I did not experience your problem (I use StringBean rather intensively) > and it would mean that the ParserException is catched somewhere in the > HTMLParser codebase and printed directly to standard out. This is not > very likely I think. That is exactly what I'm suspecting. Though unlikely - is it possible? > Maybe you can try to use StringBean directly insttead of StringExtractor: > > String content = ""; > try { > StringBean sb; > > sb = new StringBean (); > sb.setLinks (false); > sb.setURL (url); > content = sb.getStrings(); > System.out.println(content); > } catch (ParserException e) { > logger.error("Could not parse url", e); > System.out.println("test"); > } I modified my code according to your example but removed the try/catch and the exception was still written to the console the same way as it used to. When I'm not catching the exception I guess it get caught some place else? The exception occurs when the web server responds with http error code 401 Unauthorized. I'm developing a web application using Tomcat and Spring framework if that's relevant. -Øystein |
From: Joel <jo...@ha...> - 2007-01-10 20:02:49
|
I want to wrap text string with a span tag. I've tried the folowing, but I'm running into a problem, that the tag's children aren't being displayed. //New <span key="x">some text here</span> TagNode toCreate = new Span(); toCreate.setAttribute("key", getKey(str), '"'); NodeList nl = new NodeList(); nl.add(new TextNode(str)); toCreate.setChildren(nl); System.out.println(toCreate.toHtml()); This ends up showing <span key="x"> without the text node and end tag, what am I doing wrong? Joel |
From: sebb <se...@gm...> - 2007-01-10 18:13:01
|
FYI: I've now tested the parser using ScriptScanner.STRICT=false and that solved the "problem". Thanks again. On 08/01/07, sebb <se...@gm...> wrote: > Sorry, my bad - I've now read the document referenced in the scanner > source, and I see that "</" acts as the terminator unless suitably > hidden. > > S. > On 08/01/07, sebb <se...@gm...> wrote: > > Thanks for the quick reply. I'll give it a try. > > > > However, I'm not sure why the script example is bad. > > > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > > needed as a work-round for older browsers that did not understand the > > <script> tag. > > > > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > > > For parsing bad script like this you probably want to set the static > > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > > the explanation in the ScriptScanner.java file. > > > > > > sebb wrote: > > > > > > >The sample script: > > > > > > > ><HTML> > > > > <body> > > > > <script> > > > > fred = "<img src='a.gif'></img>" > > > > </script> > > > > </body> > > > ></HTML> > > > > > > > >generates the following output from parser.cmd: > > > > > > > >Tag (0[0,0],6[0,6]): HTML > > > > Txt (6[0,6],10[1,2]): \n > > > > Tag (10[1,2],16[1,8]): body > > > > Txt (16[1,8],20[2,2]): \n > > > > Tag (20[2,2],28[2,10]): script > > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > > End (57[3,27],57[3,27]): /script > > > > End (57[3,27],63[3,33]): /img > > > > Txt (63[3,33],68[4,2]): "\n > > > > End (68[4,2],77[4,11]): /script > > > > Txt (77[4,11],81[5,2]): \n > > > > End (81[5,2],88[5,9]): /body > > > > Txt (88[5,9],90[6,0]): \n > > > > End (90[6,0],97[6,7]): /HTML > > > >Txt (97[6,7],101[8,0]): \n\n > > > > > > > >It looks like the closing tag is being recognised - though the opening > > > >tag is not. > > > > > > > >Is this a bug, or have I misunderstood something? > > > > > > > >------------------------------------------------------------------------- > > > >Take Surveys. Earn Cash. Influence the Future of IT > > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > >opinions on IT & business topics through brief surveys - and earn cash > > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > >_______________________________________________ > > > >Htmlparser-user mailing list > > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > Take Surveys. Earn Cash. Influence the Future of IT > > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > > opinions on IT & business topics through brief surveys - and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > |
From: Martin S. <mst...@gm...> - 2007-01-10 16:31:33
|
Hi, Are you sure you are running your own program? StringExtractor is only a sample application for the StringBean, so it woul= d be better to use the StringBean directly instead of StringExtractor. I did not experience your problem (I use StringBean rather intensively) and it would mean that the ParserException is catched somewhere in the HTMLParser codebase and printed directly to standard out. This is not very likely I think. However, the main() method of StringExtractor does catch ParserException an= d print it to standard out. Maybe you can try to use StringBean directly insttead of StringExtractor: String content =3D ""; try { StringBean sb; sb =3D new StringBean (); sb.setLinks (false); sb.setURL (url); content =3D sb.getStrings(); System.out.println(content); } catch (ParserException e) { logger.error("Could not parse url", e); System.out.println("test"); } -- Martin Sturm 2007/1/10, =D8ystein Lervik Larsen <oys...@me...>: > > Hi list! :-] > > I seem to have a problem catching the ParserException when using > StringExtractor. > > My console says: > org.htmlparser.util.ParserException: Exception getting input stream from > (and so on...) > > ...but my logger (log4j) does not print and the string "test" is not > written to the console. > > > String content =3D ""; > try { > StringExtractor se =3D new StringExtractor(url); > content =3D se.extractStrings(false); > System.out.println(content); > } > catch(ParserException e){ > logger.error("Could not parse url",e); > System.out.println("test"); > } > > > Could the exception be handled some place else that I'm not aware of? > The string "content" is sometimes empty due to a 401 error. > > Thanks in advance for any reply! > > Best regards, > =D8ystein > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: <oys...@me...> - 2007-01-10 15:22:02
|
Hi list! :-] I seem to have a problem catching the ParserException when using StringExtractor. My console says: org.htmlparser.util.ParserException: Exception getting input stream from (and so on...) ...but my logger (log4j) does not print and the string "test" is not written to the console. String content = ""; try { StringExtractor se = new StringExtractor(url); content = se.extractStrings(false); System.out.println(content); } catch(ParserException e){ logger.error("Could not parse url",e); System.out.println("test"); } Could the exception be handled some place else that I'm not aware of? The string "content" is sometimes empty due to a 401 error. Thanks in advance for any reply! Best regards, Øystein |
From: Martin S. <mst...@gm...> - 2007-01-10 15:12:12
|
SGVsbG8sCgpJJ20gdXNpbmcgSFRNTFBhcnNlciBmb3IgZXh0cmFjdGluZyB0ZXh0IGZyb20gYSBI VE1MIHBhZ2UgaW4gb3JkZXIgdG8gaW5kZXgKaXQgdXNpbmcgYSBmdWxsIHRleHQgc2VhcmNoIGVu Z2luZS4KRHVyaW5nIHRoZSB0ZXN0aW5nIHBoYXNlLCBJIGRpc2NvdmVyZWQgdGhhdCBzb21lIHdl YiBwYWdlcyBhcmUgbm90IHBhcnNlZApjb3JyZWN0bHkgYnkgSFRNTFBhcnNlci4gT25lIG9mIHRo ZXNlIHdlYnBhZ2VzIGlzIGZvciBleGFtcGxlCmh0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS4KSSB0 aGluayB0aGUgcHJvYmxlbSBpcyB0aGF0IGFjY29yZGluZyB0byB0aGUgSFRUUCBoZWFkZXJzLCB0 aGUgZW5jb2RpbmcgaXMKaW4gVVRGLTgsIGJ1dCBpbiBIVE1MIE1FVEEgdGFncyB0aGlzIGlzIGNo YW5nZWQgdG8gVVRGLTE2LiBUaGlzIGNhbiBiZQpoYW5kbGVkIGJ5IGNhdGNoaW5nIHRoZSBFbmNv ZGluZ0NoYW5nZUV4Y2VwdGlvbiwgYnV0IHRoaXMgZG9lc24ndCBwcmV2ZW50CnRoZSB0ZXh0dWFs IGNvbnRlbnQgb2YgdGhlIHNpdGUgaW50ZXJwcmV0ZWQgaW5jb3JyZWN0bHkuCgpBIGNvbmNyZXRl IGV4YW1wbGUgdG8gc2VlIHRoZSBwcm9ibGVtOgoKICAgICAgICBTdHJpbmdCZWFuIHNiID0gbmV3 IFN0cmluZ0JlYW4oKTsKICAgICAgICBzYi5zZXRVUkwoImh0dHA6Ly93d3cubWljcm9zb2Z0LmNv bSIpOwogICAgICAgIFN5c3RlbS5vdXQucHJpbnRsbihzYi5nZXRTdHJpbmdzKCkpOwoKVGhlIG91 dHB1dCBvZiB0aGUgYWJvdmUgY29kZSBzbmlwcGV0IGlzOgrkkY/kjZTlpZDklKDmobTmtazigZDl lYLksYnkjKDiiK3ivK/lnLPkjK/ivYTlkYTigYjlkY3ksKDjkK7jgKDlkbLmha7njannkanmva4g Li4uLgoKTm90IHJlYWxseSB3aGF0IEkgd2FzIGV4cGVjdGluZy4KCkFtIEkgbWlzc2luZyBzb21l dGhpbmcsIG9yIGlzIHRoaXMgYSBidWcgaW4gdGhlIEhUTUxQYXJzZXI/CgpNYXJ0aW4gU3R1cm0K |
From: sebb <se...@gm...> - 2007-01-08 14:58:28
|
Sorry, my bad - I've now read the document referenced in the scanner source, and I see that "</" acts as the terminator unless suitably hidden. S. On 08/01/07, sebb <se...@gm...> wrote: > Thanks for the quick reply. I'll give it a try. > > However, I'm not sure why the script example is bad. > > It is not enclosed in "<!--" and "// -->", but AIUI those are only > needed as a work-round for older browsers that did not understand the > <script> tag. > > > On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > > > For parsing bad script like this you probably want to set the static > > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > > the explanation in the ScriptScanner.java file. > > > > sebb wrote: > > > > >The sample script: > > > > > ><HTML> > > > <body> > > > <script> > > > fred = "<img src='a.gif'></img>" > > > </script> > > > </body> > > ></HTML> > > > > > >generates the following output from parser.cmd: > > > > > >Tag (0[0,0],6[0,6]): HTML > > > Txt (6[0,6],10[1,2]): \n > > > Tag (10[1,2],16[1,8]): body > > > Txt (16[1,8],20[2,2]): \n > > > Tag (20[2,2],28[2,10]): script > > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > > End (57[3,27],57[3,27]): /script > > > End (57[3,27],63[3,33]): /img > > > Txt (63[3,33],68[4,2]): "\n > > > End (68[4,2],77[4,11]): /script > > > Txt (77[4,11],81[5,2]): \n > > > End (81[5,2],88[5,9]): /body > > > Txt (88[5,9],90[6,0]): \n > > > End (90[6,0],97[6,7]): /HTML > > >Txt (97[6,7],101[8,0]): \n\n > > > > > >It looks like the closing tag is being recognised - though the opening > > >tag is not. > > > > > >Is this a bug, or have I misunderstood something? > > > > > >------------------------------------------------------------------------- > > >Take Surveys. Earn Cash. Influence the Future of IT > > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > > >opinions on IT & business topics through brief surveys - and earn cash > > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: sebb <se...@gm...> - 2007-01-08 11:21:09
|
Thanks for the quick reply. I'll give it a try. However, I'm not sure why the script example is bad. It is not enclosed in "<!--" and "// -->", but AIUI those are only needed as a work-round for older browsers that did not understand the <script> tag. On 08/01/07, Derrick Oswald <Der...@ro...> wrote: > > For parsing bad script like this you probably want to set the static > boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See > the explanation in the ScriptScanner.java file. > > sebb wrote: > > >The sample script: > > > ><HTML> > > <body> > > <script> > > fred = "<img src='a.gif'></img>" > > </script> > > </body> > ></HTML> > > > >generates the following output from parser.cmd: > > > >Tag (0[0,0],6[0,6]): HTML > > Txt (6[0,6],10[1,2]): \n > > Tag (10[1,2],16[1,8]): body > > Txt (16[1,8],20[2,2]): \n > > Tag (20[2,2],28[2,10]): script > > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > > End (57[3,27],57[3,27]): /script > > End (57[3,27],63[3,33]): /img > > Txt (63[3,33],68[4,2]): "\n > > End (68[4,2],77[4,11]): /script > > Txt (77[4,11],81[5,2]): \n > > End (81[5,2],88[5,9]): /body > > Txt (88[5,9],90[6,0]): \n > > End (90[6,0],97[6,7]): /HTML > >Txt (97[6,7],101[8,0]): \n\n > > > >It looks like the closing tag is being recognised - though the opening > >tag is not. > > > >Is this a bug, or have I misunderstood something? > > > >------------------------------------------------------------------------- > >Take Surveys. Earn Cash. Influence the Future of IT > >Join SourceForge.net's Techsay panel and you'll get the chance to share your > >opinions on IT & business topics through brief surveys - and earn cash > >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2007-01-08 01:53:29
|
For parsing bad script like this you probably want to set the static boolean value org.htmlparser.scanners.ScriptScanner.STRICT to false. See the explanation in the ScriptScanner.java file. sebb wrote: >The sample script: > ><HTML> > <body> > <script> > fred = "<img src='a.gif'></img>" > </script> > </body> ></HTML> > >generates the following output from parser.cmd: > >Tag (0[0,0],6[0,6]): HTML > Txt (6[0,6],10[1,2]): \n > Tag (10[1,2],16[1,8]): body > Txt (16[1,8],20[2,2]): \n > Tag (20[2,2],28[2,10]): script > Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> > End (57[3,27],57[3,27]): /script > End (57[3,27],63[3,33]): /img > Txt (63[3,33],68[4,2]): "\n > End (68[4,2],77[4,11]): /script > Txt (77[4,11],81[5,2]): \n > End (81[5,2],88[5,9]): /body > Txt (88[5,9],90[6,0]): \n > End (90[6,0],97[6,7]): /HTML >Txt (97[6,7],101[8,0]): \n\n > >It looks like the closing tag is being recognised - though the opening >tag is not. > >Is this a bug, or have I misunderstood something? > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys - and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: sebb <se...@gm...> - 2007-01-08 00:40:53
|
The sample script: <HTML> <body> <script> fred = "<img src='a.gif'></img>" </script> </body> </HTML> generates the following output from parser.cmd: Tag (0[0,0],6[0,6]): HTML Txt (6[0,6],10[1,2]): \n Tag (10[1,2],16[1,8]): body Txt (16[1,8],20[2,2]): \n Tag (20[2,2],28[2,10]): script Txt (28[2,10],57[3,27]): \n fred = "<img src='a.gif'> End (57[3,27],57[3,27]): /script End (57[3,27],63[3,33]): /img Txt (63[3,33],68[4,2]): "\n End (68[4,2],77[4,11]): /script Txt (77[4,11],81[5,2]): \n End (81[5,2],88[5,9]): /body Txt (88[5,9],90[6,0]): \n End (90[6,0],97[6,7]): /HTML Txt (97[6,7],101[8,0]): \n\n It looks like the closing tag is being recognised - though the opening tag is not. Is this a bug, or have I misunderstood something? |
From: Jeffrey B. <jb...@cs...> - 2007-01-07 21:24:14
|
> > I'd like to be able to run StringBean on a given Node and have it give > > me all the text from that Node and from its descendants on down the > > DOM tree. I can do something similar on the whole page by using a > > Parser to first get a NodeList of all of the nodes in the tree and > > then run the following: > > > > StringExtractor sb = new StringExtractor(); > > all_nodes.visitAllNodesWith(sb); > > > > Is there a way to either get all of the descendants of a given Node or > > to otherwise get just the text from all the descendants of a given > > Node? Worst case, I can write my own recursive function that will > > gather up all the Nodes and their children and their descendants - I'm > > just thinking that there is probably an existing way to do this. > > > > Thanks! > > Jeff > > If the Node is a CompositeTag, you can (I think) use CompositeTag.accept(sb). This worked great - thanks for pointing it out! I'm sure that accept somehow makes sense as an intuitive name, but it was lost on me. -Jeff > Ian > > On 1/7/07, Jeffrey Bigham <jb...@cs...> wrote: > > Hi, > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Ian M. <ian...@gm...> - 2007-01-07 12:45:35
|
For a recursive function that walks the HTMLParser DOM-like structure, you can use the NodeTreeWalker class and just keep the text nodes. That's pretty simple. If the Node is a CompositeTag, you can (I think) use CompositeTag.accept(sb). Ian On 1/7/07, Jeffrey Bigham <jb...@cs...> wrote: > Hi, > > I'd like to be able to run StringBean on a given Node and have it give > me all the text from that Node and from its descendants on down the > DOM tree. I can do something similar on the whole page by using a > Parser to first get a NodeList of all of the nodes in the tree and > then run the following: > > StringExtractor sb = new StringExtractor(); > all_nodes.visitAllNodesWith(sb); > > Is there a way to either get all of the descendants of a given Node or > to otherwise get just the text from all the descendants of a given > Node? Worst case, I can write my own recursive function that will > gather up all the Nodes and their children and their descendants - I'm > just thinking that there is probably an existing way to do this. > > Thanks! > Jeff > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Jeffrey B. <jb...@cs...> - 2007-01-07 07:33:39
|
Hi, I'd like to be able to run StringBean on a given Node and have it give me all the text from that Node and from its descendants on down the DOM tree. I can do something similar on the whole page by using a Parser to first get a NodeList of all of the nodes in the tree and then run the following: StringExtractor sb = new StringExtractor(); all_nodes.visitAllNodesWith(sb); Is there a way to either get all of the descendants of a given Node or to otherwise get just the text from all the descendants of a given Node? Worst case, I can write my own recursive function that will gather up all the Nodes and their children and their descendants - I'm just thinking that there is probably an existing way to do this. Thanks! Jeff |
From: Ian M. <ian...@gm...> - 2007-01-05 11:47:57
|
The XML you see in your browser isn't actually XML - it's HTML-encoded XML. Therefore it's actually text. So: - Parse the document in HTML Parser, look for that div, then look for the text nodes within the div. - You now have the XML as HTML-encoded text, and you have to convert it into XML. You can convert it in a number of ways, but the easiest would be to just replace the strings %lt; and > with < and >. -You'll now have XML, use an XML parser. HTMLParser might be able to handle it - what you could do is register the various XML tags in there as CompositeTags in the PrototypicalNodeFactory to make it easier to deal with. Ian On 1/4/07, Jay Bhavsar <kin...@gm...> wrote: > Hey guys, > I have looked through all the examples and javadocs but I am still > unsuccessful. Here is want I am tying to do > > I would like to follow a link like the following. > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 > > This displays a XML format of a report. I want to parse the XML > section out of the web page and then parse 3-4 tag from the XML > section. The text in XML format is in between <div > class='recordbody'> ... </div> tag. I was using > > ------------------ > NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE")); > > NodeIterator i = divs.elements(); > > while (i.hasMoreNodes()){ > System.out.println("has more nodes"); > processMyNodes(i.nextNode()); > } > -------------------- > > based on the example from the javadocs. But anything other than > HTML in TagNameFilter returns nothing in divs.elements(). (It never > prints "has more nodes") > > Can anyone help extract the XML part from this web page? or is there a > way I can directly extract what I need from this site without saving > the XML part first and then using SAX XMLParser to extract it? > > Note everything between <div class="recordbody">...</div> is text , it > not a xml document. If you view the source you will see what I mean, > (sorry I don't mean to insult any one's intelligence but I just want > to be through about my problem.) > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Jay B. <kin...@gm...> - 2007-01-04 23:46:52
|
Hey guys, I have looked through all the examples and javadocs but I am still unsuccessful. Here is want I am tying to do I would like to follow a link like the following. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&c_start=1&list_uids=43366978&uids=&dopt=xml&dispmax=5&sendto=&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256 This displays a XML format of a report. I want to parse the XML section out of the web page and then parse 3-4 tag from the XML section. The text in XML format is in between <div class='recordbody'> ... </div> tag. I was using ------------------ NodeList divs = list.extractAllNodesThatMatch (new TagNameFilter ("TITLE")); NodeIterator i = divs.elements(); while (i.hasMoreNodes()){ System.out.println("has more nodes"); processMyNodes(i.nextNode()); } -------------------- based on the example from the javadocs. But anything other than HTML in TagNameFilter returns nothing in divs.elements(). (It never prints "has more nodes") Can anyone help extract the XML part from this web page? or is there a way I can directly extract what I need from this site without saving the XML part first and then using SAX XMLParser to extract it? Note everything between <div class="recordbody">...</div> is text , it not a xml document. If you view the source you will see what I mean, (sorry I don't mean to insult any one's intelligence but I just want to be through about my problem.) |
From: walterwkh <wal...@ho...> - 2006-12-13 16:49:36
|
how about this : =20 NodeFilter filter =3D new NodeClassFilter(ParagraphTag.class) list =3D parser.extractAllNodesThatMatch(filter); -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of = Derrick Oswald Sent: Thursday, December 14, 2006 12:02 AM To: htmlparser user list Subject: Re: [Htmlparser-user] extracting text from paragraph tag<p> Did you try the StringBean? ----- Original Message ---- From: Khan Khurram Ali <khu...@ya...> To: htm...@li... Sent: Wednesday, December 13, 2006 7:11:03 AM Subject: [Htmlparser-user] extracting text from paragraph tag<p> hi there !=20 I am new to usin htmlparser , and I need to extract pure textual content of the webpage like paragraphs in the webpage. I tried to use paragraph tag class but cannot able to work.=20 Is there any easy way. and one thing the doucments api is extreamly complicated as I can not understand it .=20 please do hellp me out with this..... I just need to extract the pure text of web page that is inside <p> tag thanking for your anticpation in advacnce. Khan,Khurram=20 _________________________________________________________________________= ___ ________ Want to start your own business? Learn how on Yahoo! Small Business. HYPERLINK "http://smallbusiness.yahoo.com/r-index" \nhttp://smallbusiness.yahoo.com/r-index -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share = your opinions on IT & business topics through brief surveys - and earn cash HYPERLINK "http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV" \nhttp://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV _______________________________________________ Htmlparser-user mailing list Htm...@li... HYPERLINK "https://lists.sourceforge.net/lists/listinfo/htmlparser-user" \nhttps://lists.sourceforge.net/lists/listinfo/htmlparser-user --=20 No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.432 / Virus Database: 268.15.18/584 - Release Date: = 12/12/2006 23:17 =20 |
From: Derrick O. <der...@ro...> - 2006-12-13 16:02:34
|
Did you try the StringBean?=0A=0A----- Original Message ----=0AFrom: Khan K= hurram Ali <khu...@ya...>=0ATo: htm...@li...urceforge= .net=0ASent: Wednesday, December 13, 2006 7:11:03 AM=0ASubject: [Htmlparser= -user] extracting text from paragraph tag<p>=0A=0Ahi there ! =0AI am new to= usin htmlparser , and I need to extract=0Apure textual content of the webp= age like paragraphs in=0Athe webpage. I tried to use paragraph tag class b= ut=0Acannot able to work. =0AIs there any easy way. and one thing the douc= ments=0Aapi is extreamly complicated as I can not understand=0Ait . =0Aplea= se do hellp me out with this..... I just need to=0Aextract the pure text o= f web page that is inside <p>=0Atag=0A=0A=0Athanking for your anticpation i= n advacnce.=0A=0AKhan,Khurram =0A=0A=0A =0A________________________________= ____________________________________________________=0AWant to start your o= wn business?=0ALearn how on Yahoo! Small Business.=0Ahttp://smallbusiness.y= ahoo.com/r-index=0A=0A-----------------------------------------------------= --------------------=0ATake Surveys. Earn Cash. Influence the Future of IT= =0AJoin SourceForge.net's Techsay panel and you'll get the chance to share = your=0Aopinions on IT & business topics through brief surveys - and earn ca= sh=0Ahttp://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID= =3DDEVDEV=0A_______________________________________________=0AHtmlparser-us= er mailing list=0AH...@li...=0Ahttps://lists.sou= rceforge.net/lists/listinfo/htmlparser-user=0A=0A=0A=0A=0A |
From: Khan K. A. <khu...@ya...> - 2006-12-13 12:11:13
|
hi there ! I am new to usin htmlparser , and I need to extract pure textual content of the webpage like paragraphs in the webpage. I tried to use paragraph tag class but cannot able to work. Is there any easy way. and one thing the doucments api is extreamly complicated as I can not understand it . please do hellp me out with this..... I just need to extract the pure text of web page that is inside <p> tag thanking for your anticpation in advacnce. Khan,Khurram ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index |
From: Agnes M. <gra...@ab...> - 2006-12-13 05:11:32
|
News Alert! Fueled by the possibility of an upcoming merger, Wild Brush Energy (WBRS) is gearing up for an explosion. Tension is building and soon the scramble to take a position will push this one off the charts. Wild Brush Energy Symbol: WBRS Current Price: $0.05 Short Term Target: $0.32 Long Term Target: $0.80 WBRS is engaged in some of the most lucrative gas regions in North America. Major discoveries are happening all the time and WBRS is in the thick of it. With the array of drilling projects Wild Brush has going on at the moment tension is building. As the drilling gets closer to completion insiders are accumulating ahead of that major discovery announcement. Finally the market is ready for explosion Wednesday December 13 2006. will be a huge growth of WBRS at 1.00 am Get ready to make some cash today! |
From: <ger...@gm...> - 2006-11-20 10:46:14
|
Hello everybody, I'm trying to move my project under maven 2.0 but I can't find any pom file for HTML Parser. Should I build it myself ?.... It surely exists one, doesn't it ? Best regards G.D. |
From: Eugeny N D. <bo...@re...> - 2006-11-08 16:50:36
|
> Hi there, I found page: http://www.katzenfinch.com/ > This page contains several links, but HtmlParser does not follow them - in > general after parsing items it has only head and meta tags available - no body > tag with links, tables etc. > > Looks like CDATA item inside JavaScript breakes things? > Could somebody please advice? I tried to use this code: import java.io.InputStream; import java.util.LinkedList; import org.apache.log4j.Logger; import org.xml.sax.Attributes; import org.xml.sax.ErrorHandler; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; import org.xml.sax.helpers.DefaultHandler; public class SAXHTMLParser extends DefaultHandler { private static Logger log = Logger.getLogger(SAXHTMLParser.class); public LinkedList parseDocument(InputStream document, String encoding) { try { org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory .createXMLReader("org.htmlparser.sax.XMLReader"); reader.setContentHandler(this); reader.setErrorHandler(new MyErrorHandler()); reader.parse(new InputSource(document)); } catch (Exception e) { log.error(e, e); } return new LinkedList(); } /** *@see org.xml.sax.helpers.DefaultHandler#startElement(java.lang.String, java.lang.String, java.lang.String, org.xml.sax.Attributes) */ public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException { // if ("img".equalsIgnoreCase(qName) || "a".equalsIgnoreCase(qName) // || "frame".equalsIgnoreCase(qName) // || "title".equalsIgnoreCase(qName) // || "base".equalsIgnoreCase(qName)) log.debug(localName); } class MyErrorHandler implements ErrorHandler { /** *@see org.xml.sax.ErrorHandler#error(org.xml.sax.SAXParseException) */ public void error(SAXParseException arg0) throws SAXException { log.error(arg0); } /** *@see org.xml.sax.ErrorHandler#fatalError(org.xml.sax.SAXParseException) */ public void fatalError(SAXParseException arg0) throws SAXException { log.error(arg0); } /** *@see org.xml.sax.ErrorHandler#warning(org.xml.sax.SAXParseException) */ public void warning(SAXParseException arg0) throws SAXException { log.error(arg0); } } } and results were [main] DEBUG SAXHTMLParser - !DOCTYPE [main] DEBUG SAXHTMLParser - HTML [main] DEBUG SAXHTMLParser - HEAD [main] DEBUG SAXHTMLParser - TITLE [main] DEBUG SAXHTMLParser - META [main] DEBUG SAXHTMLParser - META [main] DEBUG SAXHTMLParser - STYLE [main] DEBUG SAXHTMLParser - SCRIPT but if I switch to another SAX parser for HTML org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory .createXMLReader("org.ccil.cowan.tagsoup.Parser"); reader.setContentHandler(this); reader.setErrorHandler(new MyErrorHandler()); reader.parse(new InputSource(document)); I see this: [main] DEBUG .SAXHTMLParser - html [main] DEBUG .SAXHTMLParser - head [main] DEBUG .SAXHTMLParser - title [main] DEBUG .SAXHTMLParser - meta [main] DEBUG .SAXHTMLParser - meta [main] DEBUG .SAXHTMLParser - style [main] DEBUG .SAXHTMLParser - script [main] DEBUG .SAXHTMLParser - body [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - table [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - p [main] DEBUG .SAXHTMLParser - strong [main] DEBUG .SAXHTMLParser - br [main] DEBUG .SAXHTMLParser - span [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - tr [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - td [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - noscript [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - script [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - div [main] DEBUG .SAXHTMLParser - a [main] DEBUG .SAXHTMLParser - img [main] DEBUG .SAXHTMLParser - script So looks like implementation of a SAX parser in the HtmlParser is a bit buggy? Is it possible to provide custom SAX parser for HTMLParser library somehow? -- Eugene N Dzhurinsky |
From: Eugeny N D. <bo...@re...> - 2006-11-08 14:28:19
|
Hi there, I found page: http://www.katzenfinch.com/ This page contains several links, but HtmlParser does not follow them - in general after parsing items it has only head and meta tags available - no body tag with links, tables etc. Looks like CDATA item inside JavaScript breakes things? Could somebody please advice? -- Eugene N Dzhurinsky |
From: Derrick O. <Der...@Ro...> - 2006-11-06 13:10:16
|
Collect everything in a NodeList using parse(null), i.e. no filter. Then filter the NodeList each time using NodeList.extractAllNodesThatMatch(). Dave wrote: > Hi, > > Parser parser = new Parser(); > parser.setResource(http://web-site); > > ... > NodeList nodes = parser.extractAllNodesThatMatch(filter); > > NodeList nodes1 = parser.extractAllNodesThatMatch(filter); > > The first call is correct, having the right node list. > but the second call with the same filter returned null. > > I need to use the same parser multiple times without re-parsing the > same page. > parser.reset() will re-parse the same page. What should I do? > > Thanks for help. > > david > > > ------------------------------------------------------------------------ > Check out the New Yahoo! Mail > <http://us.rd.yahoo.com/evt=43257/*http://advision.webevents.yahoo.com/mailbeta>- > Fire up a more powerful email and get things done faster. > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |