From: Markus J. <mar...@op...> - 2019-04-02 11:07:25
|
Hello, We extract text from various websites, including a Danish site for recipies etc. Using HtmlUnit 2.34.1 we got stopped by NPE but this is now fixed [2]. However, we still cannot get text from the HTML. With debug logging on, we do see the recipe text of [1] being downloaded from some API [3]. But then it is up to Javascript to inject the text into the DOM, which doesn't appear to happen. I tried many variations in the code, different waits, very long waits but nothing seems to be working here. There are exceptions but they seem unrelated. client = new WebClient(BrowserVersion.CHROME); client.getOptions().setThrowExceptionOnScriptError(false); client.getOptions().setCssEnabled(false); client.getOptions().setJavaScriptEnabled(true); client.getOptions().setDownloadImages(false); client.getOptions().setThrowExceptionOnFailingStatusCode(false); client.getOptions().setPrintContentOnFailingStatusCode(false); client.getOptions().setUseInsecureSSL(true); client.getOptions().setRedirectEnabled(false); client.setJavaScriptTimeout(15000); client.waitForBackgroundJavaScript(10000l); client.waitForBackgroundJavaScriptStartingBefore(10000l); page = client.getPage(url); synchronized(page) { try { page.wait(conf.getInt("htmlunit.javascript.timeout", 15000)); } catch (Exception e) {} } client.waitForBackgroundJavaScript(10000l); client.waitForBackgroundJavaScriptStartingBefore(10000l); webResponse = page.getWebResponse(); The text i am interested in is in element <section id="sectionDetailsMain"> but it is never created/added to the DOM. Can anyone help me get the HTML properly filled by Javascript? Many thanks, Markus [1] https://www.aarstiderne.com/find-din-maaltidskasse/kvikkassen [2] https://sourceforge.net/p/htmlunit/bugs/2008/ [3] https://www.aarstiderne.com/umbraco/api/productapi/Products?url=maaltidskasser |