From: Ahmed A. <asa...@ya...> - 2016-03-04 21:25:11
|
Hi Stephen, 'brand' is a descendant of 'itemHeading', not of 'itemDetail'. The below works with latest version (with a workaround for the failling JavaScript, a bug report should be created for this). public static void main(String[] args) throws Exception { try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) { String url = "http://www.bhphotovideo.com/c/search?Ntt=kas-"; // Yes, this is a live site. Be nice. HtmlPage page = webClient.getPage(url); List<DomNode> nodeProduct = (List<DomNode>) page.getByXPath("//*[@data-selenium='itemHeading']"); if (nodeProduct.size() > 0) { for (DomNode e : nodeProduct) { System.out.println(e.asXml()); List<DomNode> b = (List<DomNode>) page.getByXPath("//span[@itemprop='brand']"); System.out.println(b); } } } } From: Stephen Paulsen <st...@lo...> To: Ahmed Ashour <asa...@ya...> Cc: "htm...@li..." <htm...@li...> Sent: Friday, March 4, 2016 7:18 PM Subject: Re: [Htmlunit-user] Nested getByXPath Has Me All Confused Hi, Ahmed. That's all well and good, but when you run against the full HTML of the whole page, or against the live site, this is what I get as output: /* * * * */ package hutesting; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.DomNode; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.util.List; /** * * @author spaulsen */ public class HUTesting { /** * @param args the command line arguments * @throws java.lang.Exception */ public static void main(String[] args) throws Exception { try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) { // String url = "http://localhost:8888/vendor.html"; String url = "http://www.bhphotovideo.com/c/search?Ntt=kas-"; // Yes, this is a live site. Be nice. HtmlPage page = webClient.getPage(url); List<DomNode> nodeProduct = (List<DomNode>) page.getByXPath("//*[@data-selenium='itemDetail']"); if (nodeProduct.size() > 0) { for (DomNode e : nodeProduct) { List<DomNode> b = (List<DomNode>) e.getByXPath("//span[@itemprop='brand']"); System.out.println(b); } } } } } Output: run: Mar 04, 2016 1:06:55 PM com.gargoylesoftware.htmlunit.html.HtmlPage loadExternalJavaScriptFile ( Irrelevant and can be ignored ) Mar 04, 2016 1:06:55 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue Mar 04, 2016 1:06:57 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue Mar 04, 2016 1:06:57 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] BUILD SUCCESSFUL (total time: 6 seconds) ----- Stephen M. Paulsen Lowing Light & Grip > On Mar 4, 2016, at 4:42 AM, Ahmed Ashour <asa...@ya...> wrote: > > Hi, > > As hinted earlier, you need to add "//" before span > > The below code prints something: > > public static void main(String[] args) throws Exception { > try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) { > > String url = "http://localhost:8080/snippet.html"; > HtmlPage page = webClient.getPage(url); > List<DomNode> nodeProduct = (List<DomNode>) page.getByXPath("//*[@data-selenium='itemDetail']"); > > if (nodeProduct.size() > 0) { > for (DomNode e : nodeProduct) { > List<DomNode> b = (List<DomNode>) e.getByXPath("//span[@itemprop='brand']"); > System.out.println(b); > } > } > } > } > > > > From: Stephen Paulsen <st...@lo...> > To: Ahmed Ashour <asa...@ya...>; htm...@li... > Sent: Thursday, March 3, 2016 7:45 PM > Subject: Re: [Htmlunit-user] Nested getByXPath Has Me All Confused > > Hi, Ahmed. > > Attached is a ZIP file which includes 3 text files: > > vendor.html > snippet.html > analyzeResults.txt > > I've obscured the obvious information about the vendor. > > You can see in the fill vendor.html that there is a lot going on. I have been able to separate out the 24 snippets that I need with the data-selenium='itemDetail', however even though the documentation, and your note, indicates the //div... should work, it does not. I've not yet tried the "contains" construction of the parameter, but I do not think that would explain why the search path doesn't work as is. > > When I apply the itemprop='brand' to the snippet, I get zero results. When I apply the //span to the snippet alone, I get *all* 24 brand listings from the complete page, even though I am asking only about the specific element in e. > > The point is to scrape the brand, name, and price from all 24 results returned by the search. > > The analyzeResults.txt is the Java I have been using. You can see some of the variations I have used in constructing the search for the brand. Until that works, I have given up on the searches for the related product name and price. > > Your thoughts? > > Thanks! > > ~ Steve > > > > ----- > Stephen M. Paulsen > Lowing Light & Grip > > > ------------------------------------------------------------------------------ _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |