|
From: Markus J. <mar...@op...> - 2019-04-02 11:03:36
|
Hello,
We extract text from various websites, including a Danish site for recipies etc. Using HtmlUnit 2.34.1 we got stopped by NPE but this is now fixed [2].
However, we still cannot get text from the HTML. With debug logging on, we do see the recipe text of [1] being downloaded from some API [3]. But then it is up to Javascript to inject the text into the DOM, which doesn't appear to happen.
I tried many variations in the code, different waits, very long waits but nothing seems to be working here. There are exceptions but they seem unrelated.
client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setDownloadImages(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setPrintContentOnFailingStatusCode(false);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setRedirectEnabled(false);
client.setJavaScriptTimeout(15000);
client.waitForBackgroundJavaScript(10000l);
client.waitForBackgroundJavaScriptStartingBefore(10000l);
page = client.getPage(url);
synchronized(page) {
try {
page.wait(conf.getInt("htmlunit.javascript.timeout", 15000));
} catch (Exception e) {}
}
client.waitForBackgroundJavaScript(10000l);
client.waitForBackgroundJavaScriptStartingBefore(10000l);
webResponse = page.getWebResponse();
The text i am interested in is in element <section id="sectionDetailsMain"> but it is never created/added to the DOM. Can anyone help me get the HTML properly filled by Javascript?
Many thanks,
Markus
[1] https://www.aarstiderne.com/find-din-maaltidskasse/kvikkassen
[2] https://sourceforge.net/p/htmlunit/bugs/2008/
[3] https://www.aarstiderne.com/umbraco/api/productapi/Products?url=maaltidskasser
|