From: RBRi <rb...@us...> - 2018-07-03 18:29:17
|
When running my code i got 'JOSEPHINE H SMITH ' What do you think you should get? --- ** [bugs:#1971] Scraping ASPX pages with HtmlUnit** **Status:** pending **Group:** 2.31 **Labels:** web scraping micrsoft asp.net aspx **Created:** Fri Jun 29, 2018 08:27 PM UTC by Trevor Maliborski **Last Updated:** Tue Jul 03, 2018 02:26 PM UTC **Owner:** RBRi I'm currently trying to set up a web scraping tool for sites that present information from medical license databases. Many of the sites I've found and wish to use are .aspx sites. I haven't found much online that discusses scraping .aspx sites with HtmlUnit, and the information I have found has not been helpful. The issue I'm having is similar to that which occurs on sites with AJAX: dynamic results are attached to the DOM when they arrive, but the DOM that HtmlUnit pulls from the headless browser only contains the static elements. I've tried pausing the scraper's main thread, tried using `waitForBackgroundJavascript()`, and using more explicit waiting techniques, e.g. something like this, which is copied from the HtmlUnit site: ~~~ for (int i = 0; i < 20; i++) { if (condition_to_happen_after_js_execution) { break; } synchronized (page) { page.wait(500); } } ~~~ Here's the code I have right now for scraping one of the sites: ~~~ private static String scrapeTexasDatabase(String firstName, String lastName) { try { WebClient webClient = new WebClient(BrowserVersion.CHROME); HtmlPage homePage = webClient.getPage("https://www.bon.texas.gov/forms/apninq.asp"); HtmlTextInput firstNameField = homePage.getForms().get(1).getInputByName("firstname"); HtmlTextInput lastNameField = homePage.getForms().get(1).getInputByName("lastname"); firstNameField.setValueAttribute(firstName); lastNameField.setValueAttribute(lastName); HtmlSubmitInput searchButton = homePage.getForms().get(1).getInputByValue("Submit"); HtmlPage resultsPage = searchButton.click(); // this should be the inner text of a heading tag which includes the name of someone // from the Texas database, but instead no h2 elements are found at all String str = resultsPage.getElementsByTagName("h2").get(0).getTextContent(); return str.trim(); } catch (Exception e) { System.out.println("Caught exception: " + e); } return null; } ~~~ Searching for "Joesph Smith" on the Texas Nursing License site being used here yields a single result. After looking through the page source, I've found that an <h2> element is placed at the top of each result, a heading which holds a given license holder's name. This information is added dynamically but I still need to be able to scrape the information for each search result. Any help would be appreciated! --- Sent from sourceforge.net because htm...@li... is subscribed to https://sourceforge.net/p/htmlunit/bugs/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/htmlunit/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list. |