From: RBRi <rb...@us...> - 2018-06-30 12:33:48
|
Hi Trevor, i think the user list is a better place for questions. HtmlUnit has no special code inside for aspx page and this is not needed at all (your browser also does not know what kind of server technologie is used to generate the Html). And finally you problem is really common. You can't interact with an control that is not visible (this is the same in real browsers). Have tuned your code a bit and at least with the latest HtmlUnit code this works fine. Please try and report if you still have problems. ~~~ String url = "https://www.bon.texas.gov/forms/apninq.asp"; try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) { final HtmlPage page = webClient.getPage(url); webClient.waitForBackgroundJavaScript(1000); // HtmlUnit does not interact with invisible elements // we have to switch the tab first HtmlAnchor nameTabAnchor = page.getAnchorByText("First & Last Name"); nameTabAnchor.click(); HtmlTextInput firstNameField = page.getForms().get(1).getInputByName("firstname"); HtmlTextInput lastNameField = page.getForms().get(1).getInputByName("lastname"); // System.out.println(firstNameField.isDisplayed()); now this is visible firstNameField.type("Joseph"); lastNameField.type("Smith"); HtmlSubmitInput searchButton = page.getForms().get(1).getInputByValue("Submit"); HtmlPage resultsPage = searchButton.click(); // this should be the inner text of a heading tag which includes the name of someone // from the Texas database, but instead no h2 elements are found at all String str = resultsPage.getElementsByTagName("h2").get(0).getTextContent(); System.out.println(str); } ~~~ Thanks for your report and thanks for using HtmlUnit. Hope it is helpful for you. --- ** [bugs:#1971] Scraping ASPX pages with HtmlUnit** **Status:** open **Group:** 2.31 **Labels:** web scraping micrsoft asp.net aspx **Created:** Fri Jun 29, 2018 08:27 PM UTC by Trevor Maliborski **Last Updated:** Sat Jun 30, 2018 12:18 PM UTC **Owner:** nobody I'm currently trying to set up a web scraping tool for sites that present information from medical license databases. Many of the sites I've found and wish to use are .aspx sites. I haven't found much online that discusses scraping .aspx sites with HtmlUnit, and the information I have found has not been helpful. The issue I'm having is similar to that which occurs on sites with AJAX: dynamic results are attached to the DOM when they arrive, but the DOM that HtmlUnit pulls from the headless browser only contains the static elements. I've tried pausing the scraper's main thread, tried using `waitForBackgroundJavascript()`, and using more explicit waiting techniques, e.g. something like this, which is copied from the HtmlUnit site: ~~~ for (int i = 0; i < 20; i++) { if (condition_to_happen_after_js_execution) { break; } synchronized (page) { page.wait(500); } } ~~~ Here's the code I have right now for scraping one of the sites: ~~~ private static String scrapeTexasDatabase(String firstName, String lastName) { try { WebClient webClient = new WebClient(BrowserVersion.CHROME); HtmlPage homePage = webClient.getPage("https://www.bon.texas.gov/forms/apninq.asp"); HtmlTextInput firstNameField = homePage.getForms().get(1).getInputByName("firstname"); HtmlTextInput lastNameField = homePage.getForms().get(1).getInputByName("lastname"); firstNameField.setValueAttribute(firstName); lastNameField.setValueAttribute(lastName); HtmlSubmitInput searchButton = homePage.getForms().get(1).getInputByValue("Submit"); HtmlPage resultsPage = searchButton.click(); // this should be the inner text of a heading tag which includes the name of someone // from the Texas database, but instead no h2 elements are found at all String str = resultsPage.getElementsByTagName("h2").get(0).getTextContent(); return str.trim(); } catch (Exception e) { System.out.println("Caught exception: " + e); } return null; } ~~~ Searching for "Joesph Smith" on the Texas Nursing License site being used here yields a single result. After looking through the page source, I've found that an <h2> element is placed at the top of each result, a heading which holds a given license holder's name. This information is added dynamically but I still need to be able to scrape the information for each search result. Any help would be appreciated! --- Sent from sourceforge.net because htm...@li... is subscribed to https://sourceforge.net/p/htmlunit/bugs/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/htmlunit/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list. |