From: Xue-Feng Y. <xy...@no...> - 2017-07-13 15:36:46
|
I made more experiments on the issue. I added the following webClient.getOptions().setUseInsecureSSL(true); webClient.getCookieManager().setCookiesEnabled(true); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager(); int count = 0; while(manager.getJobCount() > 0){ System.out.println(count + "@" + manager.getJobCount()); webClient.waitForBackgroundJavaScript(10000); count ++; } Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3. Any thought? Thanks On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm...> wrote: > > Hi, I used htmlunit for getting some other web pages. It works great. > > However, when I tried https://weather.com/weather/monthly/l/27560:4:US , > I got something not correct. > > Here are the summary of my system: > > OS: win 10 > Java: jdk1.8.0_131 > htmlunit: htmlunit-2.27-bin > > Attached are three pictures. > > eclipse-debug gives the result htmlunit got. The main code is as follows: > > webClient = new WebClient(BrowserVersion.FIREFOX_45); > webClient.getOptions().setTimeout(600 * 1000); > webClient.waitForBackgroundJavaScript(600 * 1000); > webClient.getOptions().setRedirectEnabled(true); > webClient.getOptions().setJavaScriptEnabled(true); > webClient.getOptions().setThrowExceptionOnFailingStat > usCode(false); > webClient.getOptions().setThrowExceptionOnScriptError(false); > webClient.getOptions().setCssEnabled(false); > > htmlPage = webClient.getPage(_url); > page = htmlPage.asXml(); > > view-source is the source page from Firefox. > > inspector is the debug tree from Firefox is debugger. > > It shows only Firefox debugger has the right html tree. > > My question is how to get the html tree by use of htmlunit? > > Thanks, > > Xuefeng > |
From: Albu G. <alb...@gm...> - 2017-07-13 16:48:30
|
You are really testing my memory man.... The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open *You cannot reliably tell if there are any unnamed intervals running, but you**can**shut down any that are open.* In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in : // add a fake submit button to be able to submit the form( I translated from french) loginForm.appendChild(fauxBouton ); pageEnCours = fauxBouton.click(); ///webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); / *Original call but I got trouble so:* webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo()); print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running *What this method is doing:* public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult(); int retour = this.waitForBackgroundJavaScript(tempo); return retour; } the script executed (ANNULE_LES_TIMERS is the following: /limit= 10;// // var np, n= setInterval(function(){},100000);// // np= Math.max(0, n-limit);// // while(n> np){// // clearInterval(n--);// // } //*If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...*//*even If I don't remember all the details*//* *//*I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser*//* *//*I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.*//* */ Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : > I made more experiments on the issue. I added the following > > webClient.getOptions().setUseInsecureSSL(true); > webClient.getCookieManager().setCookiesEnabled(true); > webClient.setAjaxController(new NicelyResynchronizingAjaxController()); > > JavaScriptJobManager manager = > htmlPage.getEnclosingWindow().getJobManager(); > int count = 0; > while(manager.getJobCount() > 0){ > System.out.println(count + "@" + manager.getJobCount()); > webClient.waitForBackgroundJavaScript(10000); > count ++; > } > > Then I went to sleep. It's been running for a few hours. The job count > has been changed from 20 to 3 and stayed at 3. > > Any thought? > > Thanks > > On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm... > <mailto:no...@gm...>> wrote: > > > Hi, I used htmlunit for getting some other web pages. It works great. > > However, when I tried > https://weather.com/weather/monthly/l/27560:4:US > <https://weather.com/weather/monthly/l/27560:4:US> , I got > something not correct. > > Here are the summary of my system: > > OS: win 10 > Java: jdk1.8.0_131 > htmlunit: htmlunit-2.27-bin > > Attached are three pictures. > > eclipse-debug gives the result htmlunit got. The main code is as > follows: > > webClient = new WebClient(BrowserVersion.FIREFOX_45); > webClient.getOptions().setTimeout(600 * 1000); > webClient.waitForBackgroundJavaScript(600 * 1000); > webClient.getOptions().setRedirectEnabled(true); > webClient.getOptions().setJavaScriptEnabled(true); > > webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); > webClient.getOptions().setThrowExceptionOnScriptError(false); > webClient.getOptions().setCssEnabled(false); > > htmlPage = webClient.getPage(_url); > page = htmlPage.asXml(); > > view-source is the source page from Firefox. > > inspector is the debug tree from Firefox is debugger. > > It shows only Firefox debugger has the right html tree. > > My question is how to get the html tree by use of htmlunit? > > Thanks, > > Xuefeng > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: Xue-Feng Y. <xy...@no...> - 2017-07-13 17:44:18
|
Thanks. It's a little complicated solution since I need to load a few hundreds of remote pages. I'll try this later if my current method don't work. On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm...> wrote: > You are really testing my memory man.... > > The idea,(my idea) is there are some timers set in the page (auto refresh, > update or so...) and as it is explained here: http://www.webdeveloper.com/ > forum/showthread.php?233448-Is-there-a-way-to-find-if-any- > intervals-are-still-open > > *You cannot reliably tell if there are any unnamed intervals running, but > you **can **shut down any that are open.* > > In previous answer You can see a call to a methode call > attendPourJavascriptSaufTimers, for example in : > > // add a fake submit button to be able to submit the form( I translated > from french) > loginForm.appendChild(fauxBouton ); > pageEnCours = fauxBouton.click(); > > *//webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); * *Original > call but I got trouble so:* > webClient.attendPourJavascriptSaufTimers(pageEnCours, > AttentePourJavascript.CINQ_SECONDES.getTempo()); > print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), > pageEnCours.asXml(), original); //Waiting for 5 seconds but could return > before if nothing is running > > *What this method is doing:* > > public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ > > String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); > //Use an enumeration where the scripts are described > Object result = page.executeJavaScript(texteDuScript). > getJavaScriptResult(); > int retour = this.waitForBackgroundJavaScript(tempo); > return retour; > } > the script executed (ANNULE_LES_TIMERS is the following: > *limit= 10;* > * var np, n= setInterval(function(){},100000);* > * np= Math.max(0, n-limit);* > * while(n> np){* > * clearInterval(n--);* > > > * } **If I wrote all this stuff it was because I was running into > problems like you are , not getting all the page content I should, so my > advise is to follow a little bit my track...**even If I don't remember > all the details* > *I think also you can see if there are interval set with the website you > are scrapping and DevTools console of your browser* > *I remember having done these back and forth sessions between DevTools and > htmlunit, you really have to understand completely what's running on the > site if you want to mimic it.* > > > Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : > > I made more experiments on the issue. I added the following > > webClient.getOptions().setUseInsecureSSL(true); > webClient.getCookieManager().setCookiesEnabled(true); > webClient.setAjaxController(new NicelyResynchronizingAjaxController()); > > JavaScriptJobManager manager = htmlPage.getEnclosingWindow(). > getJobManager(); > int count = 0; > while(manager.getJobCount() > 0){ > System.out.println(count + "@" + manager.getJobCount()); > webClient.waitForBackgroundJavaScript(10000); > count ++; > } > > Then I went to sleep. It's been running for a few hours. The job count has > been changed from 20 to 3 and stayed at 3. > > Any thought? > > Thanks > > On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm...> wrote: > >> >> Hi, I used htmlunit for getting some other web pages. It works great. >> >> However, when I tried https://weather.com/weather/monthly/l/27560:4:US , >> I got something not correct. >> >> Here are the summary of my system: >> >> OS: win 10 >> Java: jdk1.8.0_131 >> htmlunit: htmlunit-2.27-bin >> >> Attached are three pictures. >> >> eclipse-debug gives the result htmlunit got. The main code is as follows: >> >> webClient = new WebClient(BrowserVersion.FIREFOX_45); >> webClient.getOptions().setTimeout(600 * 1000); >> webClient.waitForBackgroundJavaScript(600 * 1000); >> webClient.getOptions().setRedirectEnabled(true); >> webClient.getOptions().setJavaScriptEnabled(true); >> webClient.getOptions().setThrowExceptionOnFailingStatusCode( >> false); >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> webClient.getOptions().setCssEnabled(false); >> >> htmlPage = webClient.getPage(_url); >> page = htmlPage.asXml(); >> >> view-source is the source page from Firefox. >> >> inspector is the debug tree from Firefox is debugger. >> >> It shows only Firefox debugger has the right html tree. >> >> My question is how to get the html tree by use of htmlunit? >> >> Thanks, >> >> Xuefeng >> > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Htmlunit-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > |
From: Albu G. <alb...@gm...> - 2017-07-13 17:48:10
|
I don't understand what you mean by "load a few hundred of remote pages", htmlunit is used to interact with pages, it's a silent browser. You interact with hundred of pages ? Le 13/07/2017 à 19:44, Xue-Feng Yang a écrit : > Thanks. It's a little complicated solution since I need to load a few > hundreds of remote pages. I'll try this later if my current method > don't work. > > On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm... > <mailto:alb...@gm...>> wrote: > > You are really testing my memory man.... > > The idea,(my idea) is there are some timers set in the page (auto > refresh, update or so...) and as it is explained here: > http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open > <http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open> > > *You cannot reliably tell if there are any unnamed intervals > running, but you**can**shut down any that are open.* > > In previous answer You can see a call to a methode call > attendPourJavascriptSaufTimers, for example in : > > // add a fake submit button to be able to submit the form( I > translated from french) > loginForm.appendChild(fauxBouton ); > pageEnCours = fauxBouton.click(); > ///webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); > / *Original call but I got trouble so:* > webClient.attendPourJavascriptSaufTimers(pageEnCours, > AttentePourJavascript.CINQ_SECONDES.getTempo()); > > print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), > pageEnCours.asXml(), original); //Waiting for 5 seconds but could > return before if nothing is running > > *What this method is doing:* > > public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ > > String texteDuScript = > ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an > enumeration where the scripts are described > Object result = > page.executeJavaScript(texteDuScript).getJavaScriptResult(); > int retour = this.waitForBackgroundJavaScript(tempo); > return retour; > } > > the script executed (ANNULE_LES_TIMERS is the following: > /limit= 10;// > // var np, n= setInterval(function(){},100000);// > // np= Math.max(0, n-limit);// > // while(n> np){// > // clearInterval(n--);// > // } > > //*If I wrote all this stuff it was because I was running into > problems like you are , not getting all the page content I should, > so my advise is to follow a little bit my track...*//*even If I > don't remember all the details*//* > *//*I think also you can see if there are interval set with the > website you are scrapping and DevTools console of your browser*//* > *//*I remember having done these back and forth sessions between > DevTools and htmlunit, you really have to understand completely > what's running on the site if you want to mimic it.*/ > /* > > */ > Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : >> I made more experiments on the issue. I added the following >> >> webClient.getOptions().setUseInsecureSSL(true); >> webClient.getCookieManager().setCookiesEnabled(true); >> webClient.setAjaxController(new >> NicelyResynchronizingAjaxController()); >> >> JavaScriptJobManager manager = >> htmlPage.getEnclosingWindow().getJobManager(); >> int count = 0; >> while(manager.getJobCount() > 0){ >> System.out.println(count + "@" + manager.getJobCount()); >> webClient.waitForBackgroundJavaScript(10000); >> count ++; >> } >> >> Then I went to sleep. It's been running for a few hours. The job >> count has been changed from 20 to 3 and stayed at 3. >> >> Any thought? >> >> Thanks >> >> On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm... >> <mailto:no...@gm...>> wrote: >> >> >> Hi, I used htmlunit for getting some other web pages. It >> works great. >> >> However, when I tried >> https://weather.com/weather/monthly/l/27560:4:US >> <https://weather.com/weather/monthly/l/27560:4:US> , I got >> something not correct. >> >> Here are the summary of my system: >> >> OS: win 10 >> Java: jdk1.8.0_131 >> htmlunit: htmlunit-2.27-bin >> >> Attached are three pictures. >> >> eclipse-debug gives the result htmlunit got. The main code is >> as follows: >> >> webClient = new WebClient(BrowserVersion.FIREFOX_45); >> webClient.getOptions().setTimeout(600 * 1000); >> webClient.waitForBackgroundJavaScript(600 * 1000); >> webClient.getOptions().setRedirectEnabled(true); >> webClient.getOptions().setJavaScriptEnabled(true); >> webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> webClient.getOptions().setCssEnabled(false); >> >> htmlPage = webClient.getPage(_url); >> page = htmlPage.asXml(); >> >> view-source is the source page from Firefox. >> >> inspector is the debug tree from Firefox is debugger. >> >> It shows only Firefox debugger has the right html tree. >> >> My question is how to get the html tree by use of htmlunit? >> >> Thanks, >> >> Xuefeng >> >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org!http://sdm.link/slashdot >> >> _______________________________________________ >> Htmlunit-user mailing list >> Htm...@li... >> <mailto:Htm...@li...> >> https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ Htmlunit-user > mailing list Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: Xue-Feng Y. <xy...@no...> - 2017-07-13 17:58:51
|
Yes, I use it to download data with the different parameters. Thanks. On Thu, Jul 13, 2017 at 1:48 PM, Albu Gmail <alb...@gm...> wrote: > I don't understand what you mean by "load a few hundred of remote pages", > htmlunit is used to interact with pages, it's a silent browser. You > interact with hundred of pages ? > > > > Le 13/07/2017 à 19:44, Xue-Feng Yang a écrit : > > Thanks. It's a little complicated solution since I need to load a few > hundreds of remote pages. I'll try this later if my current method don't > work. > > On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm...> > wrote: > >> You are really testing my memory man.... >> >> The idea,(my idea) is there are some timers set in the page (auto >> refresh, update or so...) and as it is explained here: >> http://www.webdeveloper.com/forum/showthread.php?233448-Is- >> there-a-way-to-find-if-any-intervals-are-still-open >> >> *You cannot reliably tell if there are any unnamed intervals running, but >> you **can **shut down any that are open.* >> >> In previous answer You can see a call to a methode call >> attendPourJavascriptSaufTimers, for example in : >> >> // add a fake submit button to be able to submit the form( I translated >> from french) >> loginForm.appendChild(fauxBouton ); >> pageEnCours = fauxBouton.click(); >> >> *//webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); * *Original >> call but I got trouble so:* >> webClient.attendPourJavascriptSaufTimers(pageEnCours, >> AttentePourJavascript.CINQ_SECONDES.getTempo()); >> print.save(NomsFichiersPagesSa >> uvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting >> for 5 seconds but could return before if nothing is running >> >> *What this method is doing:* >> >> public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ >> >> String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); >> //Use an enumeration where the scripts are described >> Object result = page.executeJavaScript(texteDu >> Script).getJavaScriptResult(); >> int retour = this.waitForBackgroundJavaScript(tempo); >> return retour; >> } >> the script executed (ANNULE_LES_TIMERS is the following: >> *limit= 10;* >> * var np, n= setInterval(function(){},100000);* >> * np= Math.max(0, n-limit);* >> * while(n> np){* >> * clearInterval(n--);* >> >> >> * } **If I wrote all this stuff it was because I was running into >> problems like you are , not getting all the page content I should, so my >> advise is to follow a little bit my track...**even If I don't remember >> all the details* >> *I think also you can see if there are interval set with the website you >> are scrapping and DevTools console of your browser* >> *I remember having done these back and forth sessions between DevTools >> and htmlunit, you really have to understand completely what's running on >> the site if you want to mimic it.* >> >> >> Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : >> >> I made more experiments on the issue. I added the following >> >> webClient.getOptions().setUseInsecureSSL(true); >> webClient.getCookieManager().setCookiesEnabled(true); >> webClient.setAjaxController(new NicelyResynchronizingAjaxController()); >> >> JavaScriptJobManager manager = htmlPage.getEnclosingWindow(). >> getJobManager(); >> int count = 0; >> while(manager.getJobCount() > 0){ >> System.out.println(count + "@" + manager.getJobCount()); >> webClient.waitForBackgroundJavaScript(10000); >> count ++; >> } >> >> Then I went to sleep. It's been running for a few hours. The job count >> has been changed from 20 to 3 and stayed at 3. >> >> Any thought? >> >> Thanks >> >> On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm...> wrote: >> >>> >>> Hi, I used htmlunit for getting some other web pages. It works great. >>> >>> However, when I tried https://weather.com/weather/monthly/l/27560:4:US >>> , I got something not correct. >>> >>> Here are the summary of my system: >>> >>> OS: win 10 >>> Java: jdk1.8.0_131 >>> htmlunit: htmlunit-2.27-bin >>> >>> Attached are three pictures. >>> >>> eclipse-debug gives the result htmlunit got. The main code is as follows: >>> >>> webClient = new WebClient(BrowserVersion.FIREFOX_45); >>> webClient.getOptions().setTimeout(600 * 1000); >>> webClient.waitForBackgroundJavaScript(600 * 1000); >>> webClient.getOptions().setRedirectEnabled(true); >>> webClient.getOptions().setJavaScriptEnabled(true); >>> webClient.getOptions().setThrowExceptionOnFailingStatusCode( >>> false); >>> webClient.getOptions().setThrowExceptionOnScriptError(false); >>> webClient.getOptions().setCssEnabled(false); >>> >>> htmlPage = webClient.getPage(_url); >>> page = htmlPage.asXml(); >>> >>> view-source is the source page from Firefox. >>> >>> inspector is the debug tree from Firefox is debugger. >>> >>> It shows only Firefox debugger has the right html tree. >>> >>> My question is how to get the html tree by use of htmlunit? >>> >>> Thanks, >>> >>> Xuefeng >>> >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> _______________________________________________ >> Htmlunit-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most engaging >> tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ Htmlunit-user mailing >> list Htm...@li... https://lists.sourceforge.net/ >> lists/listinfo/htmlunit-user > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > Htmlunit-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > |
From: Xue-Feng Y. <xy...@no...> - 2017-07-14 02:36:46
|
I have more progress now. After testing, BrowserVersion Chrome and FireFox52(or the best supported) can reduce the job count to 2 in 5 minutes, and the data I needed is in webClient now. The others (FireFox45, Edge, IE) can't reduce the job count to 2 in 5 minutes and I can't see the data in webClient when the job count only reduces to 3 or more. It's worth to point out that all real browsers I tested can show the data in less than one minute. Any more suggestion? On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm...> wrote: > You are really testing my memory man.... > > The idea,(my idea) is there are some timers set in the page (auto refresh, > update or so...) and as it is explained here: http://www.webdeveloper.com/ > forum/showthread.php?233448-Is-there-a-way-to-find-if-any- > intervals-are-still-open > > *You cannot reliably tell if there are any unnamed intervals running, but > you **can **shut down any that are open.* > > In previous answer You can see a call to a methode call > attendPourJavascriptSaufTimers, for example in : > > // add a fake submit button to be able to submit the form( I translated > from french) > loginForm.appendChild(fauxBouton ); > pageEnCours = fauxBouton.click(); > > *//webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); * *Original > call but I got trouble so:* > webClient.attendPourJavascriptSaufTimers(pageEnCours, > AttentePourJavascript.CINQ_SECONDES.getTempo()); > print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), > pageEnCours.asXml(), original); //Waiting for 5 seconds but could return > before if nothing is running > > *What this method is doing:* > > public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ > > String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); > //Use an enumeration where the scripts are described > Object result = page.executeJavaScript(texteDuScript). > getJavaScriptResult(); > int retour = this.waitForBackgroundJavaScript(tempo); > return retour; > } > the script executed (ANNULE_LES_TIMERS is the following: > *limit= 10;* > * var np, n= setInterval(function(){},100000);* > * np= Math.max(0, n-limit);* > * while(n> np){* > * clearInterval(n--);* > > > * } **If I wrote all this stuff it was because I was running into > problems like you are , not getting all the page content I should, so my > advise is to follow a little bit my track...**even If I don't remember > all the details* > *I think also you can see if there are interval set with the website you > are scrapping and DevTools console of your browser* > *I remember having done these back and forth sessions between DevTools and > htmlunit, you really have to understand completely what's running on the > site if you want to mimic it.* > > > Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : > > I made more experiments on the issue. I added the following > > webClient.getOptions().setUseInsecureSSL(true); > webClient.getCookieManager().setCookiesEnabled(true); > webClient.setAjaxController(new NicelyResynchronizingAjaxController()); > > JavaScriptJobManager manager = htmlPage.getEnclosingWindow(). > getJobManager(); > int count = 0; > while(manager.getJobCount() > 0){ > System.out.println(count + "@" + manager.getJobCount()); > webClient.waitForBackgroundJavaScript(10000); > count ++; > } > > Then I went to sleep. It's been running for a few hours. The job count has > been changed from 20 to 3 and stayed at 3. > > Any thought? > > Thanks > > On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm...> wrote: > >> >> Hi, I used htmlunit for getting some other web pages. It works great. >> >> However, when I tried https://weather.com/weather/monthly/l/27560:4:US , >> I got something not correct. >> >> Here are the summary of my system: >> >> OS: win 10 >> Java: jdk1.8.0_131 >> htmlunit: htmlunit-2.27-bin >> >> Attached are three pictures. >> >> eclipse-debug gives the result htmlunit got. The main code is as follows: >> >> webClient = new WebClient(BrowserVersion.FIREFOX_45); >> webClient.getOptions().setTimeout(600 * 1000); >> webClient.waitForBackgroundJavaScript(600 * 1000); >> webClient.getOptions().setRedirectEnabled(true); >> webClient.getOptions().setJavaScriptEnabled(true); >> webClient.getOptions().setThrowExceptionOnFailingStatusCode( >> false); >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> webClient.getOptions().setCssEnabled(false); >> >> htmlPage = webClient.getPage(_url); >> page = htmlPage.asXml(); >> >> view-source is the source page from Firefox. >> >> inspector is the debug tree from Firefox is debugger. >> >> It shows only Firefox debugger has the right html tree. >> >> My question is how to get the html tree by use of htmlunit? >> >> Thanks, >> >> Xuefeng >> > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Htmlunit-user mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > > |
From: Albu G. <alb...@gm...> - 2017-07-14 07:10:06
|
Which data are you extracting exactly can yu tell me it could help me to see if it could be a workaround Le 14/07/2017 à 04:36, Xue-Feng Yang a écrit : > I have more progress now. After testing, BrowserVersion Chrome and > FireFox52(or the best supported) can reduce the job count to 2 in 5 > minutes, and the data I needed is in webClient now. The others > (FireFox45, Edge, IE) can't reduce the job count to 2 in 5 minutes and > I can't see the data in webClient when the job count only reduces to 3 > or more. > > It's worth to point out that all real browsers I tested can show the > data in less than one minute. > > Any more suggestion? > > On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm... > <mailto:alb...@gm...>> wrote: > > You are really testing my memory man.... > > The idea,(my idea) is there are some timers set in the page (auto > refresh, update or so...) and as it is explained here: > http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open > <http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open> > > *You cannot reliably tell if there are any unnamed intervals > running, but you**can**shut down any that are open.* > > In previous answer You can see a call to a methode call > attendPourJavascriptSaufTimers, for example in : > > // add a fake submit button to be able to submit the form( I > translated from french) > loginForm.appendChild(fauxBouton ); > pageEnCours = fauxBouton.click(); > ///webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); > / *Original call but I got trouble so:* > webClient.attendPourJavascriptSaufTimers(pageEnCours, > AttentePourJavascript.CINQ_SECONDES.getTempo()); > > print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), > pageEnCours.asXml(), original); //Waiting for 5 seconds but could > return before if nothing is running > > *What this method is doing:* > > public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ > > String texteDuScript = > ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an > enumeration where the scripts are described > Object result = > page.executeJavaScript(texteDuScript).getJavaScriptResult(); > int retour = this.waitForBackgroundJavaScript(tempo); > return retour; > } > > the script executed (ANNULE_LES_TIMERS is the following: > /limit= 10;// > // var np, n= setInterval(function(){},100000);// > // np= Math.max(0, n-limit);// > // while(n> np){// > // clearInterval(n--);// > // } > > //*If I wrote all this stuff it was because I was running into > problems like you are , not getting all the page content I should, > so my advise is to follow a little bit my track...*//*even If I > don't remember all the details*//* > *//*I think also you can see if there are interval set with the > website you are scrapping and DevTools console of your browser*//* > *//*I remember having done these back and forth sessions between > DevTools and htmlunit, you really have to understand completely > what's running on the site if you want to mimic it.*/ > /* > > */ > Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : >> I made more experiments on the issue. I added the following >> >> webClient.getOptions().setUseInsecureSSL(true); >> webClient.getCookieManager().setCookiesEnabled(true); >> webClient.setAjaxController(new >> NicelyResynchronizingAjaxController()); >> >> JavaScriptJobManager manager = >> htmlPage.getEnclosingWindow().getJobManager(); >> int count = 0; >> while(manager.getJobCount() > 0){ >> System.out.println(count + "@" + manager.getJobCount()); >> webClient.waitForBackgroundJavaScript(10000); >> count ++; >> } >> >> Then I went to sleep. It's been running for a few hours. The job >> count has been changed from 20 to 3 and stayed at 3. >> >> Any thought? >> >> Thanks >> >> On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm... >> <mailto:no...@gm...>> wrote: >> >> >> Hi, I used htmlunit for getting some other web pages. It >> works great. >> >> However, when I tried >> https://weather.com/weather/monthly/l/27560:4:US >> <https://weather.com/weather/monthly/l/27560:4:US> , I got >> something not correct. >> >> Here are the summary of my system: >> >> OS: win 10 >> Java: jdk1.8.0_131 >> htmlunit: htmlunit-2.27-bin >> >> Attached are three pictures. >> >> eclipse-debug gives the result htmlunit got. The main code is >> as follows: >> >> webClient = new WebClient(BrowserVersion.FIREFOX_45); >> webClient.getOptions().setTimeout(600 * 1000); >> webClient.waitForBackgroundJavaScript(600 * 1000); >> webClient.getOptions().setRedirectEnabled(true); >> webClient.getOptions().setJavaScriptEnabled(true); >> webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> webClient.getOptions().setCssEnabled(false); >> >> htmlPage = webClient.getPage(_url); >> page = htmlPage.asXml(); >> >> view-source is the source page from Firefox. >> >> inspector is the debug tree from Firefox is debugger. >> >> It shows only Firefox debugger has the right html tree. >> >> My question is how to get the html tree by use of htmlunit? >> >> Thanks, >> >> Xuefeng >> >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org!http://sdm.link/slashdot >> >> _______________________________________________ >> Htmlunit-user mailing list >> Htm...@li... >> <mailto:Htm...@li...> >> https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ Htmlunit-user > mailing list Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user |