[Htmlunit-user] Getting html dom tree for any page

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I have a requirement of crawling few sites and extracting product
information like name, price, image etc.
Using Java and Apache http library, i can access static websites where
content is delivered in one call from server. But i am getting any useful
response while using the same approach for sites using javascript/ajax and
framesets.

For ex: www.kmart.com
Can you please tell me how i can use your api to get the final content of
any page, i mean the content which is actually shown by the browser. By the
documentation you have published, i feel it is possible but i am not
getting the correct results.

Here is the code i am using:::

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
    try {

             webClient.getOptions().setJavaScriptEnabled(true);
             webClient.getOptions().setThrowExceptionOnScriptError(false);

 webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setRedirectEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
             webClient.waitForBackgroundJavaScript(100000);
             //webClient.waitForBackgroundJavaScriptStartingBefore(10000);
             //webClient.setAjaxController(new
NicelyResynchronizingAjaxController());
             final HtmlPage page = webClient.getPage("http://www.kmart.com/
");
             webClient.waitForBackgroundJavaScript(100000);
             page.refresh();
             page.getHtmlPageOrNull().asText();
             System.out.println("is page ready::" + page.getReadyState());
             System.out.println(page.asText());

            //get list of all divs
            final List<?> divs = page.getByXPath("//div");

            //get div which has a 'name' attribute of 'John'
            final HtmlDivision div = (HtmlDivision)
page.getByXPath("//div[@id='gnf_header']").get(0);
            webClient.closeAllWindows();

        } catch (IOException | FailingHttpStatusCodeException ex) {

Logger.getLogger(HtmlUnitForKmart.class.getName()).log(Level.SEVERE, null,
ex);
        }

Thanks for your response.

Navneet Sharma

[Htmlunit-user] Getting html dom tree for any page

Java GUI-Less browser, supporting JavaScript, to run against web pages

[Htmlunit-user] Getting html dom tree for any page