From: Albu G. <alb...@gm...> - 2017-07-13 17:48:10
|
I don't understand what you mean by "load a few hundred of remote pages", htmlunit is used to interact with pages, it's a silent browser. You interact with hundred of pages ? Le 13/07/2017 à 19:44, Xue-Feng Yang a écrit : > Thanks. It's a little complicated solution since I need to load a few > hundreds of remote pages. I'll try this later if my current method > don't work. > > On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <alb...@gm... > <mailto:alb...@gm...>> wrote: > > You are really testing my memory man.... > > The idea,(my idea) is there are some timers set in the page (auto > refresh, update or so...) and as it is explained here: > http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open > <http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open> > > *You cannot reliably tell if there are any unnamed intervals > running, but you**can**shut down any that are open.* > > In previous answer You can see a call to a methode call > attendPourJavascriptSaufTimers, for example in : > > // add a fake submit button to be able to submit the form( I > translated from french) > loginForm.appendChild(fauxBouton ); > pageEnCours = fauxBouton.click(); > ///webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo()); > / *Original call but I got trouble so:* > webClient.attendPourJavascriptSaufTimers(pageEnCours, > AttentePourJavascript.CINQ_SECONDES.getTempo()); > > print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), > pageEnCours.asXml(), original); //Waiting for 5 seconds but could > return before if nothing is running > > *What this method is doing:* > > public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){ > > String texteDuScript = > ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an > enumeration where the scripts are described > Object result = > page.executeJavaScript(texteDuScript).getJavaScriptResult(); > int retour = this.waitForBackgroundJavaScript(tempo); > return retour; > } > > the script executed (ANNULE_LES_TIMERS is the following: > /limit= 10;// > // var np, n= setInterval(function(){},100000);// > // np= Math.max(0, n-limit);// > // while(n> np){// > // clearInterval(n--);// > // } > > //*If I wrote all this stuff it was because I was running into > problems like you are , not getting all the page content I should, > so my advise is to follow a little bit my track...*//*even If I > don't remember all the details*//* > *//*I think also you can see if there are interval set with the > website you are scrapping and DevTools console of your browser*//* > *//*I remember having done these back and forth sessions between > DevTools and htmlunit, you really have to understand completely > what's running on the site if you want to mimic it.*/ > /* > > */ > Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit : >> I made more experiments on the issue. I added the following >> >> webClient.getOptions().setUseInsecureSSL(true); >> webClient.getCookieManager().setCookiesEnabled(true); >> webClient.setAjaxController(new >> NicelyResynchronizingAjaxController()); >> >> JavaScriptJobManager manager = >> htmlPage.getEnclosingWindow().getJobManager(); >> int count = 0; >> while(manager.getJobCount() > 0){ >> System.out.println(count + "@" + manager.getJobCount()); >> webClient.waitForBackgroundJavaScript(10000); >> count ++; >> } >> >> Then I went to sleep. It's been running for a few hours. The job >> count has been changed from 20 to 3 and stayed at 3. >> >> Any thought? >> >> Thanks >> >> On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <no...@gm... >> <mailto:no...@gm...>> wrote: >> >> >> Hi, I used htmlunit for getting some other web pages. It >> works great. >> >> However, when I tried >> https://weather.com/weather/monthly/l/27560:4:US >> <https://weather.com/weather/monthly/l/27560:4:US> , I got >> something not correct. >> >> Here are the summary of my system: >> >> OS: win 10 >> Java: jdk1.8.0_131 >> htmlunit: htmlunit-2.27-bin >> >> Attached are three pictures. >> >> eclipse-debug gives the result htmlunit got. The main code is >> as follows: >> >> webClient = new WebClient(BrowserVersion.FIREFOX_45); >> webClient.getOptions().setTimeout(600 * 1000); >> webClient.waitForBackgroundJavaScript(600 * 1000); >> webClient.getOptions().setRedirectEnabled(true); >> webClient.getOptions().setJavaScriptEnabled(true); >> webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> webClient.getOptions().setCssEnabled(false); >> >> htmlPage = webClient.getPage(_url); >> page = htmlPage.asXml(); >> >> view-source is the source page from Firefox. >> >> inspector is the debug tree from Firefox is debugger. >> >> It shows only Firefox debugger has the right html tree. >> >> My question is how to get the html tree by use of htmlunit? >> >> Thanks, >> >> Xuefeng >> >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org!http://sdm.link/slashdot >> >> _______________________________________________ >> Htmlunit-user mailing list >> Htm...@li... >> <mailto:Htm...@li...> >> https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ Htmlunit-user > mailing list Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlunit-user > <https://lists.sourceforge.net/lists/listinfo/htmlunit-user> > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > Htmlunit-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlunit-user |