From: Stephen P. <st...@lo...> - 2016-03-04 18:18:16
|
Hi, Ahmed. That's all well and good, but when you run against the full HTML of the whole page, or against the live site, this is what I get as output: /* * * * */ package hutesting; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.DomNode; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.util.List; /** * * @author spaulsen */ public class HUTesting { /** * @param args the command line arguments * @throws java.lang.Exception */ public static void main(String[] args) throws Exception { try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) { // String url = "http://localhost:8888/vendor.html"; String url = "http://www.bhphotovideo.com/c/search?Ntt=kas-"; // Yes, this is a live site. Be nice. HtmlPage page = webClient.getPage(url); List<DomNode> nodeProduct = (List<DomNode>) page.getByXPath("//*[@data-selenium='itemDetail']"); if (nodeProduct.size() > 0) { for (DomNode e : nodeProduct) { List<DomNode> b = (List<DomNode>) e.getByXPath("//span[@itemprop='brand']"); System.out.println(b); } } } } } Output: run: Mar 04, 2016 1:06:55 PM com.gargoylesoftware.htmlunit.html.HtmlPage loadExternalJavaScriptFile ( Irrelevant and can be ignored ) Mar 04, 2016 1:06:55 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error Mar 04, 2016 1:06:56 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue Mar 04, 2016 1:06:57 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue Mar 04, 2016 1:06:57 PM com.gargoylesoftware.htmlunit.javascript.host.css.CSSStyleSheet pixelValue [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] BUILD SUCCESSFUL (total time: 6 seconds) ----- Stephen M. Paulsen Lowing Light & Grip > On Mar 4, 2016, at 4:42 AM, Ahmed Ashour <asa...@ya...> wrote: > > Hi, > > As hinted earlier, you need to add "//" before span > > The below code prints something: > > public static void main(String[] args) throws Exception { > try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) { > > String url = "http://localhost:8080/snippet.html"; > HtmlPage page = webClient.getPage(url); > List<DomNode> nodeProduct = (List<DomNode>) page.getByXPath("//*[@data-selenium='itemDetail']"); > > if (nodeProduct.size() > 0) { > for (DomNode e : nodeProduct) { > List<DomNode> b = (List<DomNode>) e.getByXPath("//span[@itemprop='brand']"); > System.out.println(b); > } > } > } > } > > > > From: Stephen Paulsen <st...@lo...> > To: Ahmed Ashour <asa...@ya...>; htm...@li... > Sent: Thursday, March 3, 2016 7:45 PM > Subject: Re: [Htmlunit-user] Nested getByXPath Has Me All Confused > > Hi, Ahmed. > > Attached is a ZIP file which includes 3 text files: > > vendor.html > snippet.html > analyzeResults.txt > > I've obscured the obvious information about the vendor. > > You can see in the fill vendor.html that there is a lot going on. I have been able to separate out the 24 snippets that I need with the data-selenium='itemDetail', however even though the documentation, and your note, indicates the //div... should work, it does not. I've not yet tried the "contains" construction of the parameter, but I do not think that would explain why the search path doesn't work as is. > > When I apply the itemprop='brand' to the snippet, I get zero results. When I apply the //span to the snippet alone, I get *all* 24 brand listings from the complete page, even though I am asking only about the specific element in e. > > The point is to scrape the brand, name, and price from all 24 results returned by the search. > > The analyzeResults.txt is the Java I have been using. You can see some of the variations I have used in constructing the search for the brand. Until that works, I have given up on the searches for the related product name and price. > > Your thoughts? > > Thanks! > > ~ Steve > > > > ----- > Stephen M. Paulsen > Lowing Light & Grip > > > |