You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(6) |
Jul
(17) |
Aug
(18) |
Sep
(22) |
Oct
(16) |
Nov
(6) |
Dec
(11) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(11) |
Feb
(10) |
Mar
(34) |
Apr
(26) |
May
(6) |
Jun
(22) |
Jul
(14) |
Aug
(4) |
Sep
(47) |
Oct
(69) |
Nov
(23) |
Dec
(21) |
2005 |
Jan
(53) |
Feb
(33) |
Mar
(92) |
Apr
(65) |
May
(63) |
Jun
(57) |
Jul
(43) |
Aug
(132) |
Sep
(61) |
Oct
(75) |
Nov
(60) |
Dec
(130) |
2006 |
Jan
(74) |
Feb
(87) |
Mar
(101) |
Apr
(58) |
May
(54) |
Jun
(42) |
Jul
(31) |
Aug
(67) |
Sep
(61) |
Oct
(71) |
Nov
(28) |
Dec
(58) |
2007 |
Jan
(53) |
Feb
(50) |
Mar
(96) |
Apr
(66) |
May
(55) |
Jun
(130) |
Jul
(99) |
Aug
(115) |
Sep
(37) |
Oct
(78) |
Nov
(24) |
Dec
(70) |
2008 |
Jan
(94) |
Feb
(85) |
Mar
(197) |
Apr
(274) |
May
(119) |
Jun
(143) |
Jul
(193) |
Aug
(99) |
Sep
(160) |
Oct
(120) |
Nov
(178) |
Dec
(109) |
2009 |
Jan
(238) |
Feb
(169) |
Mar
(115) |
Apr
(109) |
May
(131) |
Jun
(167) |
Jul
(144) |
Aug
(193) |
Sep
(155) |
Oct
(154) |
Nov
(97) |
Dec
(127) |
2010 |
Jan
(108) |
Feb
(127) |
Mar
(176) |
Apr
(113) |
May
(130) |
Jun
(200) |
Jul
(115) |
Aug
(80) |
Sep
(92) |
Oct
(101) |
Nov
(124) |
Dec
(53) |
2011 |
Jan
(67) |
Feb
(144) |
Mar
(88) |
Apr
(60) |
May
(89) |
Jun
(54) |
Jul
(68) |
Aug
(81) |
Sep
(48) |
Oct
(40) |
Nov
(10) |
Dec
(20) |
2012 |
Jan
(21) |
Feb
(28) |
Mar
(17) |
Apr
(35) |
May
(41) |
Jun
(44) |
Jul
(68) |
Aug
(67) |
Sep
(89) |
Oct
(58) |
Nov
(47) |
Dec
(56) |
2013 |
Jan
(49) |
Feb
(28) |
Mar
(46) |
Apr
(31) |
May
(28) |
Jun
(37) |
Jul
(34) |
Aug
(52) |
Sep
(42) |
Oct
(108) |
Nov
(59) |
Dec
(56) |
2014 |
Jan
(41) |
Feb
(72) |
Mar
(46) |
Apr
(21) |
May
(19) |
Jun
(17) |
Jul
(15) |
Aug
(40) |
Sep
(11) |
Oct
(3) |
Nov
(5) |
Dec
(31) |
2015 |
Jan
(11) |
Feb
(12) |
Mar
(19) |
Apr
(19) |
May
(38) |
Jun
(54) |
Jul
(14) |
Aug
(42) |
Sep
(14) |
Oct
(16) |
Nov
(26) |
Dec
(14) |
2016 |
Jan
(3) |
Feb
(1) |
Mar
(24) |
Apr
(5) |
May
(15) |
Jun
(14) |
Jul
(33) |
Aug
(19) |
Sep
(8) |
Oct
(10) |
Nov
|
Dec
(2) |
2017 |
Jan
(16) |
Feb
(12) |
Mar
(23) |
Apr
(8) |
May
(11) |
Jun
(20) |
Jul
(21) |
Aug
(20) |
Sep
|
Oct
(6) |
Nov
(9) |
Dec
(2) |
2018 |
Jan
(7) |
Feb
(5) |
Mar
(6) |
Apr
(5) |
May
(1) |
Jun
(2) |
Jul
(2) |
Aug
|
Sep
(4) |
Oct
(3) |
Nov
|
Dec
(4) |
2019 |
Jan
(2) |
Feb
(2) |
Mar
(3) |
Apr
(4) |
May
|
Jun
(4) |
Jul
(9) |
Aug
(2) |
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(7) |
2020 |
Jan
(2) |
Feb
(6) |
Mar
(9) |
Apr
(1) |
May
(1) |
Jun
(15) |
Jul
(1) |
Aug
(1) |
Sep
(2) |
Oct
(6) |
Nov
(3) |
Dec
(5) |
2021 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(1) |
Aug
(3) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(6) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ahmed A. <asa...@ya...> - 2014-01-02 14:21:35
|
Hi David, InputStream is = page.getWebResponse().getContentAsStream(); Then save it to a Writer. You can use commons-io: Writer writer = new FileWriter("hello.pdf"); IOUtils.copy(is, writer); writer.close(); Yours, Ahmed ________________________________ From: David Michael Gang <mic...@gm...> To: htm...@li... Sent: Thursday, January 2, 2014 3:56 PM Subject: Re: [Htmlunit-user] Htmlunit-user Digest, Vol 92, Issue 1 Hi Ahmed, Thanks for your reply. This solves halve of the problem. The immediate refresh handler redirects me automatically to the page which sends me the header "application/force-download". The question is how to emulate the browser behavior so that i get the pdf page automatically. Here is the code: package test; import java.io.IOException; import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.ImmediateRefreshHandler; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; public class ForceDownload { public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient client = new WebClient(); client.setRefreshHandler(new ImmediateRefreshHandler()); final String downloadUrl = "http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549"; final Page page = client.getPage(downloadUrl); System.out.println(page.getWebResponse().getContentType()); } } I get the output: application/force-download How can i get to the pdf? Thanks, David Message: 2 >Date: Thu, 26 Dec 2013 07:17:24 -0800 (PST) >From: Ahmed Ashour <asa...@ya...> >Subject: Re: [Htmlunit-user] deal with application/force-download >To: "htm...@li..." > <htm...@li...> >Message-ID: > <138...@we...> >Content-Type: text/plain; charset="iso-8859-1" > >Hi David, > >The page refreshes in 2 seconds and forwards to the PDF location. > >You can try: > >??????? WebClient webClient = new WebClient(); >??????? webClient.setRefreshHandler(new ImmediateRefreshHandler()); >??????? Page page = webClient.getPage("http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549"); > > >Ahmed > >________________________________ > From: David Michael Gang <mic...@gm...> >To: htm...@li... >Sent: Monday, December 23, 2013 12:24 PM >Subject: [Htmlunit-user] deal with application/force-download > > > >Hi all, > >I have the following url: >http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 > >In firefox or ie8 the page refreshes and a pdf is downloaded. >With htmlunit i try the following: >When trying to go to the top page, it returns a sort of html page and not the pdf. > >Even when trying to go directly to the download page, it does not download the pdf. > > >package test; > >import java.io.IOException; >import java.net.MalformedURLException; > >import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; >import com.gargoylesoftware.htmlunit.Page; >import com.gargoylesoftware.htmlunit.WebClient; >import com.gargoylesoftware.htmlunit.html.HtmlPage; > >public class ForceDownload { > >??? public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { >??? ??? WebClient client = new WebClient(); >??? ??? System.out.println("get to top page"); >??? ??? final String topUrl = "http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549"; >??? ??? final Page topPage = client.getPage(topUrl); >??? ??? if(topPage.isHtmlPage()) { >??? ??? ??? System.out.println("topPage is htmlPage"); >??? ??? ??? System.out.println("source of top page is "+((HtmlPage) topPage).asXml()); >??? ??? } >??? ??? >??? ??? System.out.println("get to download page directly"); >??? ??? >??? ??? final String downloadUrl = "http://archivoespa?oldearte.revistas.csic.es/index.php/aea/article/download/552/549"; >??? ??? >??? ??? final Page page = client.getPage(downloadUrl); >??? ??? System.out.println(page.getWebResponse().getContentType()); >??? ??? >??? ??? >??? } > >} > >This is the output of the script >get to top page >topPage is htmlPage >source of top page is <?xml version="1.0" encoding="UTF-8"?> ><html xmlns="http://www.w3.org/1999/xhtml"> >? <head> >??? <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> >??? <title> >????? Vasallo Toranzo >??? </title> >??? <link rel="stylesheet" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/common.css" type="text/css"/> >??? <link rel="stylesheet" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/articleView.css" type="text/css"/> >??? <link rel="icon" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/favicon.ico" type="image/x-icon"/> >??? <script type="text/javascript" src="http://xn--archivoespaoldearte-53b.revistas.csic.es/js/general.js"> >??? </script> >??? <!-- Add javascript required for font sizer -->??? <script type="text/javascript" src="http://xn--archivoespaoldearte-53b.revistas.csic.es/js/sizer.js"> >??? </script> >??? <!-- Add stylesheets for the font sizer -->??? <link rel="alternate stylesheet" title="Peque?a" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontSmall.css" type="text/css" disabled="disabled"/> >??? <link rel="stylesheet" title="Mediana" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontMedium.css" type="text/css"/> >??? <link rel="alternate stylesheet" title="Grande" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontLarge.css" type="text/css" disabled="disabled"/> >? </head> >? <frameset cols="220,*" style="border: 0;"> >??? <!-- cols="*,180"-->??? <frame src="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewRST/552/549" noresize="noresize" frameborder="0" scrolling="auto"/> >??? <frame src="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549" frameborder="0"/> >??? <noframes> >????? ><body> >??? <table width="100%"> >??? ??? <tr> >??? ??? ??? <td align="center"> >??? ??? ??? ??? Esta p?gina usa marcos. <a href="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549">Haga click aqu?</a> para ir a la versi?n sin marcos. >??? ??? ??? </td> >??? ??? </tr> >??? </table> ></body> > > >??? </noframes> >? </frameset> ></html> > >get to download page directly >application/force-download > > > >How can i solve this challenge? > >How can i tell htmlunit to download the file directly? > > >Thanks, > >David > > >------------------------------------------------------------------------------ >Rapidly troubleshoot problems before they affect your business. Most IT >organizations don't have a clear picture of how application performance >affects their revenue. With AppDynamics, you get 100% visibility into your >Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! >http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >_______________________________________________ >Htmlunit-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlunit-user >-------------- next part -------------- >An HTML attachment was scrubbed... > > > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: David M. G. <mic...@gm...> - 2014-01-02 12:57:06
|
Hi Ahmed, Thanks for your reply. This solves halve of the problem. The immediate refresh handler redirects me automatically to the page which sends me the header "application/force-download". The question is how to emulate the browser behavior so that i get the pdf page automatically. Here is the code: package test; import java.io.IOException; import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.ImmediateRefreshHandler; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; public class ForceDownload { public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient client = new WebClient(); client.setRefreshHandler(new ImmediateRefreshHandler()); final String downloadUrl = " http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549<http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/download/552/549> "; final Page page = client.getPage(downloadUrl); System.out.println(page.getWebResponse().getContentType()); } } I get the output: application/force-download How can i get to the pdf? Thanks, David Message: 2 > Date: Thu, 26 Dec 2013 07:17:24 -0800 (PST) > From: Ahmed Ashour <asa...@ya...> > Subject: Re: [Htmlunit-user] deal with application/force-download > To: "htm...@li..." > <htm...@li...> > Message-ID: > <138...@we...> > Content-Type: text/plain; charset="iso-8859-1" > > Hi David, > > The page refreshes in 2 seconds and forwards to the PDF location. > > You can try: > > ??????? WebClient webClient = new WebClient(); > ??????? webClient.setRefreshHandler(new ImmediateRefreshHandler()); > ??????? Page page = webClient.getPage(" > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549 > "); > > > Ahmed > > ________________________________ > From: David Michael Gang <mic...@gm...> > To: htm...@li... > Sent: Monday, December 23, 2013 12:24 PM > Subject: [Htmlunit-user] deal with application/force-download > > > > Hi all, > > I have the following url: > > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 > > In firefox or ie8 the page refreshes and a pdf is downloaded. > With htmlunit i try the following: > When trying to go to the top page, it returns a sort of html page and not > the pdf. > > Even when trying to go directly to the download page, it does not download > the pdf. > > > package test; > > import java.io.IOException; > import java.net.MalformedURLException; > > import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; > import com.gargoylesoftware.htmlunit.Page; > import com.gargoylesoftware.htmlunit.WebClient; > import com.gargoylesoftware.htmlunit.html.HtmlPage; > > public class ForceDownload { > > ??? public static void main(String[] args) throws > FailingHttpStatusCodeException, MalformedURLException, IOException { > ??? ??? WebClient client = new WebClient(); > ??? ??? System.out.println("get to top page"); > ??? ??? final String topUrl = " > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 > "; > ??? ??? final Page topPage = client.getPage(topUrl); > ??? ??? if(topPage.isHtmlPage()) { > ??? ??? ??? System.out.println("topPage is htmlPage"); > ??? ??? ??? System.out.println("source of top page is "+((HtmlPage) > topPage).asXml()); > ??? ??? } > ??? ??? > ??? ??? System.out.println("get to download page directly"); > ??? ??? > ??? ??? final String downloadUrl = " > http://archivoespa?oldearte.revistas.csic.es/index.php/aea/article/download/552/549 > "; > ??? ??? > ??? ??? final Page page = client.getPage(downloadUrl); > ??? ??? System.out.println(page.getWebResponse().getContentType()); > ??? ??? > ??? ??? > ??? } > > } > > This is the output of the script > get to top page > topPage is htmlPage > source of top page is <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > ? <head> > ??? <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> > ??? <title> > ????? Vasallo Toranzo > ??? </title> > ??? <link rel="stylesheet" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/common.css" > type="text/css"/> > ??? <link rel="stylesheet" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/articleView.css" > type="text/css"/> > ??? <link rel="icon" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/favicon.ico" > type="image/x-icon"/> > ??? <script type="text/javascript" src=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/js/general.js"> > ??? </script> > ??? <!-- Add javascript required for font sizer -->??? <script > type="text/javascript" src=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/js/sizer.js"> > ??? </script> > ??? <!-- Add stylesheets for the font sizer -->??? <link rel="alternate > stylesheet" title="Peque?a" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontSmall.css" > type="text/css" disabled="disabled"/> > ??? <link rel="stylesheet" title="Mediana" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontMedium.css" > type="text/css"/> > ??? <link rel="alternate stylesheet" title="Grande" href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontLarge.css" > type="text/css" disabled="disabled"/> > ? </head> > ? <frameset cols="220,*" style="border: 0;"> > ??? <!-- cols="*,180"-->??? <frame src=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewRST/552/549" > noresize="noresize" frameborder="0" scrolling="auto"/> > ??? <frame src=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549" > frameborder="0"/> > ??? <noframes> > ????? > <body> > ??? <table width="100%"> > ??? ??? <tr> > ??? ??? ??? <td align="center"> > ??? ??? ??? ??? Esta p?gina usa marcos. <a href=" > http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549">Haga > click aqu?</a> para ir a la versi?n sin marcos. > ??? ??? ??? </td> > ??? ??? </tr> > ??? </table> > </body> > > > ??? </noframes> > ? </frameset> > </html> > > get to download page directly > application/force-download > > > > How can i solve this challenge? > > How can i tell htmlunit to download the file directly? > > > Thanks, > > David > > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > -------------- next part -------------- > An HTML attachment was scrubbed... > > > |
From: David M. G. <mic...@gm...> - 2014-01-02 12:47:34
|
Hi all, For the following url: http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549<http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/download/552/549> When dispatched in ie8 a pdf file is automatically downloaded. As i understand htmlunit should emulate the browser behavior. When running the following code: package test; import java.io.IOException; import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; public class ForceDownload { public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient client = new WebClient(); final String downloadUrl = " http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549<http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/download/552/549> "; final Page page = client.getPage(downloadUrl); System.out.println(page.getWebResponse().getContentType()); } } I get as output: application/force-download Is there a way to say to htmlunit to download the pdf automatically like in ie8? Thanks, David |
From: Ahmed A. <asa...@ya...> - 2013-12-26 15:17:32
|
Hi David, The page refreshes in 2 seconds and forwards to the PDF location. You can try: WebClient webClient = new WebClient(); webClient.setRefreshHandler(new ImmediateRefreshHandler()); Page page = webClient.getPage("http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549"); Ahmed ________________________________ From: David Michael Gang <mic...@gm...> To: htm...@li... Sent: Monday, December 23, 2013 12:24 PM Subject: [Htmlunit-user] deal with application/force-download Hi all, I have the following url: http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 In firefox or ie8 the page refreshes and a pdf is downloaded. With htmlunit i try the following: When trying to go to the top page, it returns a sort of html page and not the pdf. Even when trying to go directly to the download page, it does not download the pdf. package test; import java.io.IOException; import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlPage; public class ForceDownload { public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient client = new WebClient(); System.out.println("get to top page"); final String topUrl = "http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549"; final Page topPage = client.getPage(topUrl); if(topPage.isHtmlPage()) { System.out.println("topPage is htmlPage"); System.out.println("source of top page is "+((HtmlPage) topPage).asXml()); } System.out.println("get to download page directly"); final String downloadUrl = "http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549"; final Page page = client.getPage(downloadUrl); System.out.println(page.getWebResponse().getContentType()); } } This is the output of the script get to top page topPage is htmlPage source of top page is <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title> Vasallo Toranzo </title> <link rel="stylesheet" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/common.css" type="text/css"/> <link rel="stylesheet" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/articleView.css" type="text/css"/> <link rel="icon" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/favicon.ico" type="image/x-icon"/> <script type="text/javascript" src="http://xn--archivoespaoldearte-53b.revistas.csic.es/js/general.js"> </script> <!-- Add javascript required for font sizer --> <script type="text/javascript" src="http://xn--archivoespaoldearte-53b.revistas.csic.es/js/sizer.js"> </script> <!-- Add stylesheets for the font sizer --> <link rel="alternate stylesheet" title="Pequeña" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontSmall.css" type="text/css" disabled="disabled"/> <link rel="stylesheet" title="Mediana" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontMedium.css" type="text/css"/> <link rel="alternate stylesheet" title="Grande" href="http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontLarge.css" type="text/css" disabled="disabled"/> </head> <frameset cols="220,*" style="border: 0;"> <!-- cols="*,180"--> <frame src="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewRST/552/549" noresize="noresize" frameborder="0" scrolling="auto"/> <frame src="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549" frameborder="0"/> <noframes> <body> <table width="100%"> <tr> <td align="center"> Esta página usa marcos. <a href="http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549">Haga click aquí</a> para ir a la versión sin marcos. </td> </tr> </table> </body> </noframes> </frameset> </html> get to download page directly application/force-download How can i solve this challenge? How can i tell htmlunit to download the file directly? Thanks, David ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: gaurab.pradhan <gau...@gm...> - 2013-12-26 15:04:00
|
Hi Ahmed, thanks for reply, yes i have edited my log4j.properties but still i am getting this Error message. org.apache.http.level = INFO com.gargoylesoftware.htmlunit.level = WARNING com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.level = OFF com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject.level = OFF com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.level = OFF com.gargoylesoftware.htmlunit.html.HtmlScript.level = OFF com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl.level = OFF com.gargoylesoftware.htmlunit.WebClient.level = OFF define the root logger log4j.rootLogger = INFO, console, file Due to this only few of warning are logged off but ERROR com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter - runtimeError: is still giving me problem. -- View this message in context: http://htmlunit.10904.n7.nabble.com/How-to-turnoff-Log-message-tp32585p32589.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. |
From: Ahmed A. <asa...@ya...> - 2013-12-26 14:04:54
|
Hi, How about: - org.apache.log4j.Logger.getLogger() - Edit log4j.properties Ahmed ________________________________ From: gaurab.pradhan <gau...@gm...> To: htm...@li... Sent: Thursday, December 26, 2013 9:42 AM Subject: [Htmlunit-user] How to turnoff Log message? Hello Everyone, Can any tell me how to turn off following erros log? ERROR com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter - runtimeError: I am using LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog"); java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); Still i am getting error log. Thanks -- View this message in context: http://htmlunit.10904.n7.nabble.com/How-to-turnoff-Log-message-tp32585.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: gaurab.pradhan <gau...@gm...> - 2013-12-26 06:42:59
|
Hello Everyone, Can any tell me how to turn off following erros log? ERROR com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter - runtimeError: I am using LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog"); java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); Still i am getting error log. Thanks -- View this message in context: http://htmlunit.10904.n7.nabble.com/How-to-turnoff-Log-message-tp32585.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. |
From: Ahmed A. <asa...@ya...> - 2013-12-23 10:04:07
|
Hi David, There is no document. But it would be much better, if you first isolate a minimal test case (e.g. very few lines of HTML/JavaScript without JS library) that shows the specific method/attribute which is not properly handled by HtmlUnit. Making a test case that can be included in SVN test suite would be helpful. Once you have a case, others can quickly help you where to dig into the code. Generally: JavaScript-related code in com.gargoylesoftware.htmlunit.javascript.host, and HTML in com.gargoylesoftware.htmlunit.html packages Ahmed ________________________________ From: David Michael Gang <mic...@gm...> To: htm...@li... Sent: Monday, December 23, 2013 12:50 PM Subject: [Htmlunit-user] htmlunit behind the scenes Hi all, In the last time I use htmlunit extensively. Sometimes i need help with certain issues and don't find help in google. Normally when asking in this mailing list i get answers very quickly, but I want to start to solve the problems alone. I tried to understand the code of htmlunit through debugging, but did not have success because there are a lot of callback handlers involved. Is there a document with general design concepts of the components of htmlunit? I did not find it in the web page of htmlunit. Thanks, David ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: David M. G. <mic...@gm...> - 2013-12-23 09:50:52
|
Hi all, In the last time I use htmlunit extensively. Sometimes i need help with certain issues and don't find help in google. Normally when asking in this mailing list i get answers very quickly, but I want to start to solve the problems alone. I tried to understand the code of htmlunit through debugging, but did not have success because there are a lot of callback handlers involved. Is there a document with general design concepts of the components of htmlunit? I did not find it in the web page of htmlunit. Thanks, David |
From: David M. G. <mic...@gm...> - 2013-12-23 09:25:03
|
Hi all, I have the following url: http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 In firefox or ie8 the page refreshes and a pdf is downloaded. With htmlunit i try the following: When trying to go to the top page, it returns a sort of html page and not the pdf. Even when trying to go directly to the download page, it does not download the pdf. package test; import java.io.IOException; import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlPage; public class ForceDownload { public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient client = new WebClient(); System.out.println("get to top page"); final String topUrl = " http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/view/552/549 "; final Page topPage = client.getPage(topUrl); if(topPage.isHtmlPage()) { System.out.println("topPage is htmlPage"); System.out.println("source of top page is "+((HtmlPage) topPage).asXml()); } System.out.println("get to download page directly"); final String downloadUrl = " http://archivoespañoldearte.revistas.csic.es/index.php/aea/article/download/552/549<http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/download/552/549> "; final Page page = client.getPage(downloadUrl); System.out.println(page.getWebResponse().getContentType()); } } This is the output of the script get to top page topPage is htmlPage source of top page is <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title> Vasallo Toranzo </title> <link rel="stylesheet" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/common.css" type="text/css"/> <link rel="stylesheet" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/articleView.css" type="text/css"/> <link rel="icon" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/favicon.ico" type="image/x-icon"/> <script type="text/javascript" src=" http://xn--archivoespaoldearte-53b.revistas.csic.es/js/general.js"> </script> <!-- Add javascript required for font sizer --> <script type="text/javascript" src=" http://xn--archivoespaoldearte-53b.revistas.csic.es/js/sizer.js"> </script> <!-- Add stylesheets for the font sizer --> <link rel="alternate stylesheet" title="Pequeña" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontSmall.css" type="text/css" disabled="disabled"/> <link rel="stylesheet" title="Mediana" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontMedium.css" type="text/css"/> <link rel="alternate stylesheet" title="Grande" href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/styles/fontLarge.css" type="text/css" disabled="disabled"/> </head> <frameset cols="220,*" style="border: 0;"> <!-- cols="*,180"--> <frame src=" http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewRST/552/549" noresize="noresize" frameborder="0" scrolling="auto"/> <frame src=" http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549" frameborder="0"/> <noframes> <body> <table width="100%"> <tr> <td align="center"> Esta página usa marcos. <a href=" http://xn--archivoespaoldearte-53b.revistas.csic.es/index.php/aea/article/viewDownloadInterstitial/552/549">Haga click aquí</a> para ir a la versión sin marcos. </td> </tr> </table> </body> </noframes> </frameset> </html> get to download page directly application/force-download How can i solve this challenge? How can i tell htmlunit to download the file directly? Thanks, David |
From: David M. G. <mic...@gm...> - 2013-12-22 14:02:37
|
Hi all, I wanted to see what the memory usage of htmlunit is: I had the following program: String url = "http://www.redalyc.org/articulo.oa?id=231018652001"; WebClient client =new WebClient(); HtmlPage page = client.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor: anchors ) { System.out.println(anchor.getTextContent()); } When running it with jvisualvm i saw that it had a peek of about 150 MB. This lead me to the question what the recommended heap size is when running htmlunit. This is important when running many htmlunit processes in parallel. Thanks, David |
From: Gaurab P. <gau...@gm...> - 2013-12-20 04:12:17
|
Hello to All, I am new to Htmlunit, I am using Htmlunit 2.13 to crawl web site, i can handle java script webpages easily using Htmlunit, but from couple of days i have having problem, current i am trying to crawl some information from webpage, that webpage is made using Ajax and other stuff. I am not getting desired pages after clicking anchor tag. On the web page link is embedded in following way. <a id="bidSearchForm:linksBidSearchResults:0:j_idt108" href="#" class="ui-commandlink" onclick="PrimeFaces.ab({source:'bidSearchForm:linksBidSearchResults:0:j_idt108'});return false;"> there are multiple of this type of links i need to go to each of links. Here is my code: WebClient client = new WebClient(BrowserVersion.CHROME); //i have tried with FIREFOX_17 as well LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog"); java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); client.setAjaxController(new NicelyResynchronizingAjaxController()); client.waitForBackgroundJavaScript(10000); client.waitForBackgroundJavaScriptStartingBefore(10000); client.getOptions().setCssEnabled(true); client.setCssErrorHandler(new SilentCssErrorHandler()); client.getOptions().setThrowExceptionOnFailingStatusCode(false); client.getOptions().setThrowExceptionOnScriptError(false); client.getOptions().setRedirectEnabled(true); client.getOptions().setAppletEnabled(false); client.getOptions().setJavaScriptEnabled(true); client.getOptions().setPopupBlockerEnabled(true); client.getOptions().setTimeout(5000); client.getOptions().setPrintContentOnFailingStatusCode(false); client.getOptions().setUseInsecureSSL(true); HtmlPage homePage = client.getPage("SomeURL"); synchronized (homePage) { homePage.wait(5000); } List<HtmlAnchor> anchors = new ArrayList<HtmlAnchor>(); anchors = homePage.getAnchors(); HtmlPage tempPage = null; for (int i = 0; i < anchors.size(); i++) { tempPage = anchors.get(i).click(); String secondLink = homePage.getUrl().toString(); System.out.println("LINK TO GET DOCUMENTS = " + secondLink); client.waitForBackgroundJavaScript(10000); client.waitForBackgroundJavaScriptStartingBefore(10000); } -- Best Regards, Gaurab Pradhan |
From: gaurab.pradhan <gau...@gm...> - 2013-12-20 04:06:24
|
Hello to All, I am new to Htmlunit, I am using Htmlunit 2.13 to crawl web site, i can handle java script webpages easily using Htmlunit, but from couple of days i have having problem, current i am trying to crawl some information from webpage, that webpage is made using Ajax and other stuff. I am not getting desired pages after clicking anchor tag. On the web page link is embedded in following way. there are multiple of this type of links i need to go to each of links. Here is my code: WebClient client = new WebClient(BrowserVersion.CHROME); //i have tried with FIREFOX_17 as well LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog"); java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); client.setAjaxController(new NicelyResynchronizingAjaxController()); client.waitForBackgroundJavaScript(10000); client.waitForBackgroundJavaScriptStartingBefore(10000); client.getOptions().setCssEnabled(true); client.setCssErrorHandler(new SilentCssErrorHandler()); client.getOptions().setThrowExceptionOnFailingStatusCode(false); client.getOptions().setThrowExceptionOnScriptError(false); client.getOptions().setRedirectEnabled(true); client.getOptions().setAppletEnabled(false); client.getOptions().setJavaScriptEnabled(true); client.getOptions().setPopupBlockerEnabled(true); client.getOptions().setTimeout(5000); client.getOptions().setPrintContentOnFailingStatusCode(false); client.getOptions().setUseInsecureSSL(true); HtmlPage homePage = client.getPage("SomeURL"); synchronized (homePage) { homePage.wait(5000); } List <#> <HtmlAnchor> anchors = new ArrayList<HtmlAnchor>(); anchors = homePage.getAnchors(); HtmlPage tempPage = null; for (int i = 0; i < anchors.size(); i++) { tempPage = anchors.get(i).click(); String secondLink = homePage.getUrl().toString(); System.out.println("LINK TO GET DOCUMENTS = " + secondLink); client.waitForBackgroundJavaScript(10000); client.waitForBackgroundJavaScriptStartingBefore(10000); } Please Help me to go out of this problem.. Thanks, Gaurab Pradhan -- View this message in context: http://htmlunit.10904.n7.nabble.com/Problem-to-click-Anchor-Tag-Embedded-with-AJax-and-Javascript-tp32518.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. |
From: Tomas G. <t_g...@ho...> - 2013-12-19 13:40:16
|
Hi I want to run the onclick javascript function, on the html code shown below, with java and HtmlUnit and get the info that the scripts produce. <td width="13"><a class="img_attachfile16" alt="show files" title="show files" rel="Hide files" href="#" onclick="return (DocumentFiles.ShowHideFiles(12, 345234, 0, this))"></a></td> What function ShowHideFiles do is that when you click on the link with your mouse it will show you names of files(in the table cell) that can be downloaded and their urls. I want to get this information with java and HtmlUnit but i cant get it to work. This is what i've tried: webClient = new WebClient(BrowserVersion.FIREFOX_17); //I dont know if the javascript function uses ajax, but i use this because of tips from internet. webClient.setAjaxController(new NicelyResynchronizingAjaxController()); HtmlPage page = webClient.getPage(url); List tables = page3.getByXPath("//table"); //Get the right table HtmlTable table = (HtmlTable) tables.get(11); List<HtmlTableCell> cells = tableRow.getCells(); //Get the right cell and its childelement(where the onclick function is). HtmlElement element = cells.get(10).getFirstElementChild(); String clickAttr = element.getOnClickAttribute(); ScriptResult scriptResult = page.executeJavaScript(clickAttr); //Get new page HtmlPage page2 = (HtmlPage) scriptResult.getNewPage(); webClient.waitForBackgroundJavaScript(1000); System.out.println(page2.asXml()); The new page i get is the same as the original one, meaning that i don't see the extra info i want. I've also tried to execute the click from the HtmlElement like this: //Same as before but now get HtmlAnchor object from right cell HtmlAnchor anchor = (HtmlAnchor) cells.get(10).getFirstElementChild(); page2 = (HtmlPage) anchor.click(); webClient.waitForBackgroundJavaScript(1000); System.out.println(page2.asXml()); Like with previous try the new page i get is same as the original one with no extra info. Have i forgotten something? What else can i try to do? I'm not sure how and when to use to use waitForBackgroundJavaScript and setAjaxController I would really appreciate suggestions on how to solve this. Best regards /Tomas |
From: Sebastian C. <seb...@ou...> - 2013-12-19 11:32:44
|
Hello, Not at this time. I'm working on locating the problem. All I have is a heap dump which I cannot share. I've tried to replicate it, but so far no luck. Will keep you informed if/when I find something. In the mean time, would emptying windows_ on closeAllWindows() work or would that cause other issues? I've built a version locally that does this (without the JavaScriptJobManager part posted earlier) and tested it a bit and it solves the heap problem I'm having. Not saying it should be added to the repo because the problem may be on my side, and it would also be nice to locate the cause of the problem before changing anything, but if it doesn't have any negative consequences it would make my Christmas a bit easier :) Cheers, Sebastian On 2013-12-19 11:40, Ahmed Ashour wrote: > > Hi Sebastian, > > > The below test case works fine, even if there is a JS exception in > frame1 for example. > > > Can you provide more details? > > |
From: Ahmed A. <asa...@ya...> - 2013-12-19 10:40:15
|
Hi Sebastian, The below test case works fine, even if there is a JS exception in frame1 for example. Can you provide more details? /** * @throws Exception if the test fails */ @Test public void gc() throws Exception { final String html = "<html><head><title>frames</title></head>\n" + "<frameset cols='180,*'>\n" + "<frame name='f1' src='1.html'/>\n" + "<frame name='f2' src='2.html'/>\n" + "</frameset>\n" + "</html>"; final String frame1 = "<html><head><title>1</title></head>\n" + "<body>1" + "<script>\n" + " parent.frames['f2'].location.href = '3.html';\n" + "</script>\n" + "</body></html>"; final String frame3 = "<html><head><title>page 3</title></head><body></body></html>"; final WebClient webClient = getWebClient(); final MockWebConnection conn = new MockWebConnection(); webClient.setWebConnection(conn); conn.setDefaultResponse("<html><head><title>default</title></head><body></body></html>"); conn.setResponse(URL_FIRST, html); conn.setResponse(new URL(URL_FIRST, "1.html"), frame1); conn.setResponse(new URL(URL_FIRST, "3.html"), frame3); webClient.getPage(URL_FIRST); Assert.assertEquals(3, webClient.getWebWindows().size()); webClient.closeAllWindows(); Assert.assertEquals(1, webClient.getWebWindows().size()); } Yours, Ahmed ________________________________ From: Sebastian Cato <seb...@ou...> To: htm...@li... Sent: Thursday, December 19, 2013 12:14 PM Subject: [Htmlunit-user] Lingering FrameWindow instances in WebClient Greetings, I have automatically generated test cases which fetches a page, runs tests on that page, drops all references to that page, calls closeAllWindows on the WebClient, fetches the next page, runs tests on it depending on the outcome of the tests for the previous page, &c. The WebClient instance has a fairly long life time. JavaScript is enabled and some if the tested sites use a lot of it. I'm using HtmlUnit 2.13. My problem is that the heap fills up after some time of doing this. The heap is set to a couple of GBs. Looking at the heap dump, I can see that the private instance variable windows_ in the WebClient object is full of FrameWindow objects. These objects in turn have references to HtmlPage objects. These objects will never be collected. There's also a lot of WeakReference-objects in jobManagers_. This is less of an issue than the content of windows_ though, but it seems to be related. I don't know if this is a bug or if I'm doing something wrong. A thought I had was to create a new WebClient object every N calls to getPage, but I would rather not do this. I tried adding the following code to WebClient#closeAllWindows(), at the end of it: + /* Clear all but the current WebWindow */ + for(Iterator<WebWindow> it = windows_.iterator(); it.hasNext();) { + WebWindow window = it.next(); + if (window != currentWindow_) { + it.remove(); + fireWindowClosed(new WebWindowEvent(window, + WebWindowEvent.CLOSE, + window.getEnclosedPage(), null)); + } + } + + /* clear all the job managers */ + for(Iterator<WeakReference<JavaScriptJobManager>> it = + jobManagers_.iterator(); it.hasNext();) { + JavaScriptJobManager manager = it.next().get(); + if (manager != null) { + manager.shutdown(); + } + + it.remove(); + } My understanding of HtmlUnit is a bit limited, so I don't know if it's a good idea. It *seems* to work, heap usage is reduced and nothing crashes, except sometimes JS seems to get stuck in a loop, which may or may not have something to do with shutting down the JavaScriptJobManagers. I don't know if I'm doing something wrong that causes the windows_ variable to fill up. I'm calling closeAllWindows() between every test, but without the code above, windows_ fills up after a while. I also feel that I have too little to go on to provide a bug report, should it be a bug. That's why I'm asking on the Users list. I'm sorry to say that I have not been able to reproduce this problem consistently because of the complexity of the tests I'm running, and I can't share any heap dumps or code. Sorry. //Sebastian ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: Sebastian C. <seb...@ou...> - 2013-12-19 09:14:50
|
Greetings, I have automatically generated test cases which fetches a page, runs tests on that page, drops all references to that page, calls closeAllWindows on the WebClient, fetches the next page, runs tests on it depending on the outcome of the tests for the previous page, &c. The WebClient instance has a fairly long life time. JavaScript is enabled and some if the tested sites use a lot of it. I'm using HtmlUnit 2.13. My problem is that the heap fills up after some time of doing this. The heap is set to a couple of GBs. Looking at the heap dump, I can see that the private instance variable windows_ in the WebClient object is full of FrameWindow objects. These objects in turn have references to HtmlPage objects. These objects will never be collected. There's also a lot of WeakReference-objects in jobManagers_. This is less of an issue than the content of windows_ though, but it seems to be related. I don't know if this is a bug or if I'm doing something wrong. A thought I had was to create a new WebClient object every N calls to getPage, but I would rather not do this. I tried adding the following code to WebClient#closeAllWindows(), at the end of it: + /* Clear all but the current WebWindow */ + for(Iterator<WebWindow> it = windows_.iterator(); it.hasNext();) { + WebWindow window = it.next(); + if (window != currentWindow_) { + it.remove(); + fireWindowClosed(new WebWindowEvent(window, + WebWindowEvent.CLOSE, + window.getEnclosedPage(), null)); + } + } + + /* clear all the job managers */ + for(Iterator<WeakReference<JavaScriptJobManager>> it = + jobManagers_.iterator(); it.hasNext();) { + JavaScriptJobManager manager = it.next().get(); + if (manager != null) { + manager.shutdown(); + } + + it.remove(); + } My understanding of HtmlUnit is a bit limited, so I don't know if it's a good idea. It *seems* to work, heap usage is reduced and nothing crashes, except sometimes JS seems to get stuck in a loop, which may or may not have something to do with shutting down the JavaScriptJobManagers. I don't know if I'm doing something wrong that causes the windows_ variable to fill up. I'm calling closeAllWindows() between every test, but without the code above, windows_ fills up after a while. I also feel that I have too little to go on to provide a bug report, should it be a bug. That's why I'm asking on the Users list. I'm sorry to say that I have not been able to reproduce this problem consistently because of the complexity of the tests I'm running, and I can't share any heap dumps or code. Sorry. //Sebastian |
From: Ahmed A. <asa...@ya...> - 2013-12-18 15:14:19
|
Hi Tobias, - Please use latest version if you aren't already - You need to provide complete details (hopefully small) so others can reproduce your issue. - Please read http://htmlunit.sourceforge.net/submittingJSBugs.html Ahmed ________________________________ From: Tobias Ceglarek <ma...@ce...> To: htm...@li... Sent: Wednesday, December 18, 2013 4:23 PM Subject: [Htmlunit-user] Nested Frames with ASPX Hello, this is my first contribution to list. I am a hobbyist and just try out to scrape a site which gives me a table with some informations of interest for me. As you can see in the code below I am loading a page and clicking some links and submit a form. This works very well. Then things going complicated: In a browser you see a button. Which is created by an server side script (aspx). Clicking on the button another server side script is executed. This is how I get the link to the second server side script: link = page.getByXPath("//a[@onclick]")[0] Now I have a html-page. In this page there is a iframe with a nested server side script (aspx). With frame = page.getFrames().get(0) and page = frame.getEnclosedPage() I succesfully retrieve the html-page of this first frame. In this html-page there are again two server side scripts nested. But if I try to receive the html-page I just retrieve a JavaScriptPage with the content: „you cannot open directly.“ For me these ugly nesting of Frames and server side scripts is to avoid scraping. Can anybody help me ? Regards, Tobias Here is my code: import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def main(): webclient = WebClient(BrowserVersion.FIREFOX_17) url = „<URL>" page = webclient.getPage(url) print "new page loaded: "+url link = page.getByXPath("//a[@href='index.php?id=733']")[1] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() form = page.getByXPath("//form[@action='index.php?id=intern']")[0] user = form.getInputByName("user") passw = form.getInputByName("pass") button = form.getInputByName("submit") user.setValueAttribute("cgl") passw.setValueAttribute("rattamahatta") page = button.click() print "form submitted and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@href='index.php?id=837']")[0] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@onclick]")[0] page = link.click() print "button clicked and new page loaded: "+page.getUrl().toString() frame = page.getFrames().get(0) page = frame.getEnclosedPage() print "new iframe loaded: "+page.getUrl().toString() frames = page.getFrames() page1 = frames.get(0).getEnclosedPage() print "new iframe loaded: "+page1.getUrl().toString() # break if __name__ == '__main__': main() ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: Tobias C. <ma...@ce...> - 2013-12-18 13:36:21
|
Hello, this is my first contribution to list. I am a hobbyist and just try out to scrape a site which gives me a table with some informations of interest for me. As you can see in the code below I am loading a page and clicking some links and submit a form. This works very well. Then things going complicated: In a browser you see a button. Which is created by an server side script (aspx). Clicking on the button another server side script is executed. This is how I get the link to the second server side script: link = page.getByXPath("//a[@onclick]")[0] Now I have a html-page. In this page there is a iframe with a nested server side script (aspx). With frame = page.getFrames().get(0) and page = frame.getEnclosedPage() I succesfully retrieve the html-page of this first frame. In this html-page there are again two server side scripts nested. But if I try to receive the html-page I just retrieve a JavaScriptPage with the content: „you cannot open directly.“ For me these ugly nesting of Frames and server side scripts is to avoid scraping. Can anybody help me ? Regards, Tobias Here is my code: import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def main(): webclient = WebClient(BrowserVersion.FIREFOX_17) url = „<URL>" page = webclient.getPage(url) print "new page loaded: "+url link = page.getByXPath("//a[@href='index.php?id=733']")[1] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() form = page.getByXPath("//form[@action='index.php?id=intern']")[0] user = form.getInputByName("user") passw = form.getInputByName("pass") button = form.getInputByName("submit") user.setValueAttribute("cgl") passw.setValueAttribute("rattamahatta") page = button.click() print "form submitted and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@href='index.php?id=837']")[0] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@onclick]")[0] page = link.click() print "button clicked and new page loaded: "+page.getUrl().toString() frame = page.getFrames().get(0) page = frame.getEnclosedPage() print "new iframe loaded: "+page.getUrl().toString() frames = page.getFrames() page1 = frames.get(0).getEnclosedPage() print "new iframe loaded: "+page1.getUrl().toString() # break if __name__ == '__main__': main() |
From: David M. G. <mic...@gm...> - 2013-12-17 11:35:27
|
Hi Ahmed, This worked fine. I understand that i need to get the top page because window.open opens an unconnected page. Thanks a lot for your prompt response, David On Tue, Dec 17, 2013 at 12:27 PM, < htm...@li...> wrote: > Send Htmlunit-user mailing list submissions to > htm...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > or, via email, send a message with subject or body 'help' to > htm...@li... > > You can reach the person managing the list at > htm...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Htmlunit-user digest..." > > > Today's Topics: > > 1. Re: How to identify if page gets refreshed and how to wait > for it (Ahmed Ashour) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 17 Dec 2013 02:27:11 -0800 (PST) > From: Ahmed Ashour <asa...@ya...> > Subject: Re: [Htmlunit-user] How to identify if page gets refreshed > and how to wait for it > To: "htm...@li..." > <htm...@li...> > Message-ID: > <138...@we...> > Content-Type: text/plain; charset="iso-8859-1" > > Hi David, > > It seems you need to get the top page, something like: > > > ? ? public static void main(String[] args) throws > FailingHttpStatusCodeException, MalformedURLException, IOException { > ? ? ? ? WebClient webClient = new WebClient(); > ? ? ? ? webClient.setRefreshHandler(new ImmediateRefreshHandler()); > ? ? ? ? webClient.getOptions().setThrowExceptionOnScriptError(false); > ? ? ? ? String url = " > http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 > "; > ? ? ? ? HtmlPage page = webClient.getPage(url); > > > ? ? ? ? List<HtmlAnchor> anchors = page.getAnchors(); > ? ? ? ? for (HtmlAnchor anchor:anchors) { > ? ? ? ? ? ? String linkText = anchor.getTextContent(); > ? ? ? ? ? ? if(linkText.contains("pdf")) { > ? ? ? ? ? ? ? ? System.out.println("clicking on anchor:"+anchor); > > ? ? ? ? ? ? ? ? Page pdfPage = anchor.click(); > ? ? ? ? ? ? ? ? System.out.println("URL 1 " + pdfPage.getUrl()); > ? ? ? ? ? ? ? ? webClient.waitForBackgroundJavaScriptStartingBefore(10000); > ? ? ? ? ? ? ? ? System.out.println("URL 2 " + pdfPage.getUrl()); > ? ? ? ? ? ? ? ? System.out.println("URL 3 " + > webClient.getTopLevelWindows().get(0).getEnclosedPage().getUrl()); > ? ? ? ? ? ? ? ? if(pdfPage.isHtmlPage()) { > ? ? ? ? ? ? ? ? ? ? HtmlPage p = (HtmlPage) pdfPage; > ? ? ? ? ? ? ? ? } > ? ? ? ? ? ? ? ? else { > ? ? ? ? ? ? ? ? ? ? System.out.println("Page is pdf"); > ? ? ? ? ? ? ? ? ? ? System.out.println(pdfPage); > ? ? ? ? ? ? ? ? } > ? ? ? ? ? ? } > ? ? ? ? } > ? ? } > > Yours, > Ahmed > > ________________________________ > From: David Michael Gang <mic...@gm...> > To: htm...@li... > Sent: Tuesday, December 17, 2013 1:06 PM > Subject: Re: [Htmlunit-user] How to identify if page gets refreshed and > how to wait for it > > > > Hi, > > It seems that there is a more basic issue. > In the page > > http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 > I have the pdf article link > > When i press on the link, i don't get the new page. > > > Here is the code: > package test; > > import java.io.IOException; > import java.net.MalformedURLException; > import java.util.List; > > import com.gargoylesoftware.htmlunit.BrowserVersion; > import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; > import com.gargoylesoftware.htmlunit.ImmediateRefreshHandler; > import com.gargoylesoftware.htmlunit.NiceRefreshHandler; > import com.gargoylesoftware.htmlunit.Page; > import com.gargoylesoftware.htmlunit.WebClient; > import com.gargoylesoftware.htmlunit.html.HtmlAnchor; > import com.gargoylesoftware.htmlunit.html.HtmlPage; > import com.gargoylesoftware.htmlunit.javascript.configuration.WebBrowser; > import com.google.common.collect.ImmutableList; > > public class Test{ > > ??? public static void main(String[] args) throws > FailingHttpStatusCodeException, MalformedURLException, IOException { > ??? ??? WebClient webClient = new WebClient(); > ??? ??? webClient.setRefreshHandler(new ImmediateRefreshHandler()); > ??? ??? webClient.getOptions().setThrowExceptionOnScriptError(false); > ??? ??? List<String> urls = ImmutableList.of(" > http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 > "); > ??? ??? for(String url:urls) { > ??? ??? ??? HtmlPage page = webClient.getPage(url); > ??? ??? ??? > ??? ??? ??? > ??? ??? ??? List<HtmlAnchor> anchors = page.getAnchors(); > ??? ??? ??? for (HtmlAnchor anchor:anchors) { > ??? ??? ??? ??? String linkText = anchor.getTextContent(); > ??? ??? ??? ??? if(linkText.contains("pdf")) { > ??? ??? ??? ??? ??? System.out.println("clicking on anchor:"+anchor); > ??? ??? ??? ??? ??? Page pdfPage = anchor.click(); > ??? ??? ??? ??? ??? > webClient.waitForBackgroundJavaScriptStartingBefore(1000); > ??? ??? ??? ??? ??? if(pdfPage.isHtmlPage()) { > ??? ??? ??? ??? ??? ??? HtmlPage p = (HtmlPage) pdfPage; > ??? ??? ??? ??? ??? ??? > ??? ??? ??? ??? ??? ??? System.out.println(p.asText()); > ??? ??? ??? ??? ??? } > ??? ??? ??? ??? ??? else { > ??? ??? ??? ??? ??? ??? System.out.println("Page is pdf"); > ??? ??? ??? ??? ??? ??? System.out.println(pdfPage); > ??? ??? ??? ??? ??? } > ??? ??? ??? ??? } > ??? ??? ??? } > } > > ??? } > > } > > > > Here is the output: > 17/12/2013 12:04:40 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:40 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[Unexpected call to method or property > access] sourceName=[ > http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] > line=[35] lineSource=[null] lineOffset=[0] > 17/12/2013 12:04:40 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[The data necessary to complete this > operation is not yet available.] sourceName=[ > http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] > line=[16] lineSource=[null] lineOffset=[0] > 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient > tryExecute > INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when > processing request: The target server failed to respond > 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient > tryExecute > INFO: Retrying request > 17/12/2013 12:04:42 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:43 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:43 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:44 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:44 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[The data necessary to complete this > operation is not yet available.] sourceName=[ > http://s7.addthis.com/static/r07/core113.js] line=[2] lineSource=[null] > lineOffset=[0] > 17/12/2013 12:04:46 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'text/javascript'. > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:5310] Error in style rule. (Invalid token "*". Was expecting one of: > <EOF>, <S>, <IDENT>, "}", ";".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > warning > WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' > [1:5310] Ignoring the following declarations in this rule. > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:5383] Error in style rule. (Invalid token "*". Was expecting one of: > <EOF>, <S>, <IDENT>, "}", ";".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > warning > WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' > [1:5383] Ignoring the following declarations in this rule. > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:62894] Error in expression. (Invalid token "#0d98fb". Was expecting one > of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, > <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, > <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, > <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:62911] Error in style rule. (Invalid token "background-image". Was > expecting one of: <EOF>, "}", ";".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > warning > WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' > [1:62911] Ignoring the following declarations in this rule. > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:63424] Error in expression. (Invalid token "#0a85dd". Was expecting one > of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, > <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, > <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, > <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:63441] Error in style rule. (Invalid token "background-image". Was > expecting one of: <EOF>, "}", ";".) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > warning > WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' > [1:63441] Ignoring the following declarations in this rule. > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > error > WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' > [1:83274] Error in @media rule. (Invalid token "and". Was expecting one of: > <S>, <LBRACE>, <COMMA>.) > 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler > warning > WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' > [1:83274] Ignoring the whole rule. > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor > WARNING: Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash.7'. > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[ > http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] > lineOffset=[0] > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor > WARNING: Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash.6'. > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[ > http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] > lineOffset=[0] > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor > WARNING: Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash'. > 17/12/2013 12:04:51 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[Automation server can't create object for > 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[ > http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] > lineOffset=[0] > 17/12/2013 12:04:53 > com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify > WARNING: Obsolete content type encountered: 'application/x-javascript'. > 17/12/2013 12:04:53 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[An invalid or illegal selector was > specified (selector: '.box:eq(0)' error: Invalid selector: *.box:eq(0)).] > sourceName=[ > http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] > line=[91] lineSource=[null] lineOffset=[0] > 17/12/2013 12:04:53 > com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError > SEVERE: runtimeError: message=[An invalid or illegal selector was > specified (selector: '.box:last' error: Invalid selector: *.box:last).] > sourceName=[ > http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] > line=[91] lineSource=[null] lineOffset=[0] > clicking on anchor:HtmlAnchor[<a href="javascript:%20void(0);%20" > onclick="setTimeout("window.open(' > http://www.scielo.org.co/scielo.php?script=sci_pdf&pid=S0120-99572012000600002&lng=en&nrm=iso&tlng=es','_self')", 3000);">] > Revista Colombiana de Gastroenterologia - I. Epidemiolog?a > ??? > ? > ???? ? > Services on Demand > Article > Article in pdf format > Article in xml format > Article references > How to cite this article > Automatic translation > Send this article by e-mail > Indicators > Related links > Bookmark > ... > > What do i wrong? > > Here is the log: > > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > > ------------------------------ > > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > End of Htmlunit-user Digest, Vol 91, Issue 11 > ********************************************* > |
From: Ahmed A. <asa...@ya...> - 2013-12-17 10:53:59
|
Hi, >> The "getPage" do not do this job It should! HtmlUnit automatically recursively process all JavaScript. Please read [1] and [2] If you still want to see the .JS and save them yourself, you can use: WebClient webClient = new WebClient(); new WebConnectionWrapper(webClient) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); String url = request.getUrl().toExternalForm(); String content = response.getContentAsString(); return response; } }; webClient.getPage("http://www.google.com"); Thread.sleep(10000); Yours, Ahmed [1] http://htmlunit.sourceforge.net/faq.html#AJAXDoesNotWork [2] http://htmlunit.sourceforge.net/submittingJSBugs.html ________________________________ From: 赵芮元 <zry...@16...> To: htmlunit群邮件 <htm...@li...> Sent: Tuesday, December 17, 2013 1:43 PM Subject: [Htmlunit-user] loading and extracting javascripts(.js files) Hi, When we use the function webClient.getPage("http://www.google.com") for example, we get the html page of Google. Numbers of javascripts codes are embeded in it. Some of the js codes are embeded directly in the html page while some of them are in the form like this <script src="http://su.bdimg.com/static/superpage/js/sbase_3963a90b.js"></script> which needs further request for the source code of it. The "getPage" do not do this job. But in normal browsers fetch all the javascrips and execute them. Then my question is, how to extract all these javascript files to my file system in a web page such as "www.google.com" by using htmlunit? Thanks a lot! Yours, Zhao Ruiyuan ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: 赵芮元 <zry...@16...> - 2013-12-17 10:43:12
|
Hi, When we use the function webClient.getPage("http://www.google.com") for example, we get the html page of Google. Numbers of javascripts codes are embeded in it. Some of the js codes are embeded directly in the html page while some of them are in the form like this <script src="http://su.bdimg.com/static/superpage/js/sbase_3963a90b.js"></script> which needs further request for the source code of it. The "getPage" do not do this job. But in normal browsers fetch all the javascrips and execute them. Then my question is, how to extract all these javascript files to my file system in a web page such as "www.google.com" by using htmlunit? Thanks a lot! Yours, Zhao Ruiyuan |
From: Ahmed A. <asa...@ya...> - 2013-12-17 10:27:20
|
Hi David, It seems you need to get the top page, something like: public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient webClient = new WebClient(); webClient.setRefreshHandler(new ImmediateRefreshHandler()); webClient.getOptions().setThrowExceptionOnScriptError(false); String url = "http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002"; HtmlPage page = webClient.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor:anchors) { String linkText = anchor.getTextContent(); if(linkText.contains("pdf")) { System.out.println("clicking on anchor:"+anchor); Page pdfPage = anchor.click(); System.out.println("URL 1 " + pdfPage.getUrl()); webClient.waitForBackgroundJavaScriptStartingBefore(10000); System.out.println("URL 2 " + pdfPage.getUrl()); System.out.println("URL 3 " + webClient.getTopLevelWindows().get(0).getEnclosedPage().getUrl()); if(pdfPage.isHtmlPage()) { HtmlPage p = (HtmlPage) pdfPage; } else { System.out.println("Page is pdf"); System.out.println(pdfPage); } } } } Yours, Ahmed ________________________________ From: David Michael Gang <mic...@gm...> To: htm...@li... Sent: Tuesday, December 17, 2013 1:06 PM Subject: Re: [Htmlunit-user] How to identify if page gets refreshed and how to wait for it Hi, It seems that there is a more basic issue. In the page http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 I have the pdf article link When i press on the link, i don't get the new page. Here is the code: package test; import java.io.IOException; import java.net.MalformedURLException; import java.util.List; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.ImmediateRefreshHandler; import com.gargoylesoftware.htmlunit.NiceRefreshHandler; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlAnchor; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.javascript.configuration.WebBrowser; import com.google.common.collect.ImmutableList; public class Test{ public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient webClient = new WebClient(); webClient.setRefreshHandler(new ImmediateRefreshHandler()); webClient.getOptions().setThrowExceptionOnScriptError(false); List<String> urls = ImmutableList.of("http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002"); for(String url:urls) { HtmlPage page = webClient.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor:anchors) { String linkText = anchor.getTextContent(); if(linkText.contains("pdf")) { System.out.println("clicking on anchor:"+anchor); Page pdfPage = anchor.click(); webClient.waitForBackgroundJavaScriptStartingBefore(1000); if(pdfPage.isHtmlPage()) { HtmlPage p = (HtmlPage) pdfPage; System.out.println(p.asText()); } else { System.out.println("Page is pdf"); System.out.println(pdfPage); } } } } } } Here is the output: 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Unexpected call to method or property access] sourceName=[http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[35] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[16] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: Retrying request 17/12/2013 12:04:42 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:43 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:43 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:44 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:44 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://s7.addthis.com/static/r07/core113.js] line=[2] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:46 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'text/javascript'. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:5310] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:5310] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:5383] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:5383] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:62894] Error in expression. (Invalid token "#0d98fb". Was expecting one of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:62911] Error in style rule. (Invalid token "background-image". Was expecting one of: <EOF>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:62911] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:63424] Error in expression. (Invalid token "#0a85dd". Was expecting one of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:63441] Error in style rule. (Invalid token "background-image". Was expecting one of: <EOF>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:63441] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:83274] Error in @media rule. (Invalid token "and". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:83274] Ignoring the whole rule. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '.box:eq(0)' error: Invalid selector: *.box:eq(0)).] sourceName=[http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[91] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '.box:last' error: Invalid selector: *.box:last).] sourceName=[http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[91] lineSource=[null] lineOffset=[0] clicking on anchor:HtmlAnchor[<a href="javascript:%20void(0);%20" onclick="setTimeout("window.open('http://www.scielo.org.co/scielo.php?script=sci_pdf&pid=S0120-99572012000600002&lng=en&nrm=iso&tlng=es ','_self')", 3000);">] Revista Colombiana de Gastroenterologia - I. Epidemiolog?a Services on Demand Article Article in pdf format Article in xml format Article references How to cite this article Automatic translation Send this article by e-mail Indicators Related links Bookmark ... What do i wrong? Here is the log: ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: David M. G. <mic...@gm...> - 2013-12-17 10:06:54
|
Hi, It seems that there is a more basic issue. In the page http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 I have the pdf article link When i press on the link, i don't get the new page. Here is the code: package test; import java.io.IOException; import java.net.MalformedURLException; import java.util.List; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.ImmediateRefreshHandler; import com.gargoylesoftware.htmlunit.NiceRefreshHandler; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlAnchor; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.javascript.configuration.WebBrowser; import com.google.common.collect.ImmutableList; public class Test{ public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient webClient = new WebClient(); webClient.setRefreshHandler(new ImmediateRefreshHandler()); webClient.getOptions().setThrowExceptionOnScriptError(false); List<String> urls = ImmutableList.of(" http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 "); for(String url:urls) { HtmlPage page = webClient.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor:anchors) { String linkText = anchor.getTextContent(); if(linkText.contains("pdf")) { System.out.println("clicking on anchor:"+anchor); Page pdfPage = anchor.click(); webClient.waitForBackgroundJavaScriptStartingBefore(1000); if(pdfPage.isHtmlPage()) { HtmlPage p = (HtmlPage) pdfPage; System.out.println(p.asText()); } else { System.out.println("Page is pdf"); System.out.println(pdfPage); } } } } } } Here is the output: 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Unexpected call to method or property access] sourceName=[ http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[35] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:40 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[ http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[16] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond 17/12/2013 12:04:41 org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: Retrying request 17/12/2013 12:04:42 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:43 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:43 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:44 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:44 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[ http://s7.addthis.com/static/r07/core113.js] line=[2] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:46 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'text/javascript'. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:5310] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:5310] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:5383] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:5383] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:62894] Error in expression. (Invalid token "#0d98fb". Was expecting one of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:62911] Error in style rule. (Invalid token "background-image". Was expecting one of: <EOF>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:62911] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:63424] Error in expression. (Invalid token "#0a85dd". Was expecting one of: <S>, <NUMBER>, <IDENT>, <STRING>, <PLUS>, <COMMA>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <URI>, "-", "=", ")".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:63441] Error in style rule. (Invalid token "background-image". Was expecting one of: <EOF>, "}", ";".) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:63441] Ignoring the following declarations in this rule. 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error WARNING: CSS error: 'http://s7.addthis.com/static/r07/widget118.css' [1:83274] Error in @media rule. (Invalid token "and". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 17/12/2013 12:04:47 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning WARNING: CSS warning: 'http://s7.addthis.com/static/r07/widget118.css' [1:83274] Ignoring the whole rule. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[ http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[ http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 17/12/2013 12:04:51 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[ http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'. 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '.box:eq(0)' error: Invalid selector: *.box:eq(0)).] sourceName=[ http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[91] lineSource=[null] lineOffset=[0] 17/12/2013 12:04:53 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '.box:last' error: Invalid selector: *.box:last).] sourceName=[ http://www.scielo.org.co/applications/scielo-org/js/jquery-1.4.2.min.js] line=[91] lineSource=[null] lineOffset=[0] clicking on anchor:HtmlAnchor[<a href="javascript:%20void(0);%20" onclick="setTimeout("window.open(' http://www.scielo.org.co/scielo.php?script=sci_pdf&pid=S0120-99572012000600002&lng=en&nrm=iso&tlng=es','_self')", 3000);">] Revista Colombiana de Gastroenterologia - I. Epidemiolog?a Services on Demand Article Article in pdf format Article in xml format Article references How to cite this article Automatic translation Send this article by e-mail Indicators Related links Bookmark ... What do i wrong? Here is the log: |
From: Ahmed A. <asa...@ya...> - 2013-12-17 07:38:00
|
Hi David, You can use: http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/ImmediateRefreshHandler.html webClient.setRefreshHandler(). Ahmed ________________________________ From: David Michael Gang <mic...@gm...> To: htm...@li... Sent: Tuesday, December 17, 2013 10:28 AM Subject: [Htmlunit-user] How to identify if page gets refreshed and how to wait for it Hi all, For the following link: http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 I want to get the address of the pdf. When pressing on the pdf link with the text "Article in pdf format" i get to a refresh page. How do i know that i am now in a refresh page and how many seconds i will wait? Alternatively, how can i get the pdf link without waiting. I tried to find a solution on my own I wanted somehow to parse the redirect page source and get the following tag (which you get with view source): <meta name="added" content="7;URL=http://www.scielo.org.co/pdf/rcg/v27s2/v27s2a02.pdf" http-equiv="refresh"> When trying to make the same with htmlunit: public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient webClient = new WebClient(); List<String> urls = ImmutableList.of("http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002"); for(String url:urls) { HtmlPage page = webClient.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor:anchors) { String linkText = anchor.getTextContent(); if(linkText.contains("pdf")) { Page pdfPage = anchor.click(); if(pdfPage.isHtmlPage()) { HtmlPage p = (HtmlPage) pdfPage; System.out.println(p.asXml()); } } } } } I don't get this tag in the resulting xml. In summary i have the following questions: How do i know that i am now in a refresh page and how many seconds i will wait? After identifying that i am in a refresh page, how do i wait and refresh after this number of seconds? how can i get the pdf link without waiting. Where did the <meta name="added" content="7;URL=http://www.scielo.org.co/pdf/rcg/v27s2/v27s2a02.pdf" http-equiv="refresh"> tag disappear? Thanks, David ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |