From: Teryl T. <ter...@gm...> - 2015-11-12 23:41:18
|
Hi guys, There is a website that runs a flash video: http://fast.wistia.net/embed/playlists/ba40yik7fl?loop=true&autoPlay=true&controlsVisibleOnLoad=false&version=v1&videoFoam=true And I'm trying to find the embed or object tags that launch the video, so I did the following: BrowserVersion browser = BrowserVersion.FIREFOX_38; PluginConfiguration acrobat = new PluginConfiguration("Adobe Acrobat", "Adobe PDF Plug-In For Firefox and Netscape 11.0.12", "11", "nppdf32.dll"); browser.getPlugins().add(acrobat); PluginConfiguration flash = new PluginConfiguration("Shockwave Flash","Shockwave Flash 19.0 r0", "19", "NPSWF32_19_0_0_226.dll"); //17,0,0,188 flash.getMimeTypes().add(new PluginConfiguration.MimeType("application/x-shockwave-flash", "Shockwave Flash", "swf")); browser.getPlugins().add(flash); webClient.addRequestHeader("ClientIP", clientIP); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setAppletEnabled(true); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setTimeout(20000); webClient.setJavaScriptTimeout(5000); webClient.getOptions().setJavaScriptEnabled(true); webClient.getCookieManager().setCookiesEnabled(true); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); Page p = webClient.getPage(url); if(p != null && p.isHtmlPage()) { int scriptsWaiting = webClient.waitForBackgroundJavaScript(30000); System.out.println("Scripts waiting: " + Integer.toString(scriptsWaiting)); HtmlPage page = (HtmlPage)p; java.util.List<HtmlEmbed> embeds = (List<HtmlEmbed>) page.getByXPath("//embed"); for(HtmlEmbed embed: embeds) { System.out.println("Embed tag found..\n"); } java.util.List<HtmlObject> objects = (List<HtmlObject>) page.getByXPath("//object"); for(HtmlObject object: objects) { System.out.println("Object tag found..\n"); } } The website just sits there for the 30 seconds, and no tags are found. Everything is generated by javascript. Am I searching for the tags in the proper way? Or is there some DOM object I should be checking? There are no exceptions in HTMLUnit. And hte site loads up great in the regular firefox. Any advice you could give would be great. Best, Teryl |
From: Ahmed A. <asa...@ya...> - 2015-11-13 10:03:04
Attachments:
blob.jpg
|
Hi Teryl, In real Chrome, there are no generated elements as well. I am not sure how the website works. I guess you need to dig in the JavaScript, please read http://htmlunit.sourceforge.net/submittingJSBugs.html Ahmed From: Teryl Taylor <ter...@gm...> To: htm...@li... Sent: Friday, November 13, 2015 12:41 AM Subject: [Htmlunit-user] Question about scraping object/embed tags from website Hi guys, There is a website that runs a flash video: http://fast.wistia.net/embed/playlists/ba40yik7fl?loop=true&autoPlay=true&controlsVisibleOnLoad=false&version=v1&videoFoam=true And I'm trying to find the embed or object tags that launch the video, so I did the following: BrowserVersion browser = BrowserVersion.FIREFOX_38; PluginConfiguration acrobat = new PluginConfiguration("Adobe Acrobat", "Adobe PDF Plug-In For Firefox and Netscape 11.0.12", "11", "nppdf32.dll"); browser.getPlugins().add(acrobat);PluginConfiguration flash = new PluginConfiguration("Shockwave Flash","Shockwave Flash 19.0 r0", "19", "NPSWF32_19_0_0_226.dll"); //17,0,0,188 flash.getMimeTypes().add(new PluginConfiguration.MimeType("application/x-shockwave-flash", "Shockwave Flash", "swf")); browser.getPlugins().add(flash); webClient.addRequestHeader("ClientIP", clientIP); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setAppletEnabled(true); webClient.getOptions().setCssEnabled(true); webClient.getOptions().setTimeout(20000); webClient.setJavaScriptTimeout(5000); webClient.getOptions().setJavaScriptEnabled(true); webClient.getCookieManager().setCookiesEnabled(true); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); Page p = webClient.getPage(url);if(p != null && p.isHtmlPage()){ int scriptsWaiting = webClient.waitForBackgroundJavaScript(30000); System.out.println("Scripts waiting: " + Integer.toString(scriptsWaiting)); HtmlPage page = (HtmlPage)p; java.util.List<HtmlEmbed> embeds = (List<HtmlEmbed>) page.getByXPath("//embed"); for(HtmlEmbed embed: embeds) { System.out.println("Embed tag found..\n"); } java.util.List<HtmlObject> objects = (List<HtmlObject>) page.getByXPath("//object"); for(HtmlObject object: objects) { System.out.println("Object tag found..\n"); } } The website just sits there for the 30 seconds, and no tags are found. Everything is generated by javascript. Am I searching for the tags in the proper way? Or is there some DOM object I should be checking? There are no exceptions in HTMLUnit. And hte site loads up great in the regular firefox. Any advice you could give would be great. Best, Teryl ------------------------------------------------------------------------------ _______________________________________________ Htmlunit-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlunit-user |
From: tptaylor <ter...@gm...> - 2015-11-13 23:07:59
|
Hi Ahmed, Thanks for the reply. I've been doing some digging, and the script is actually an event loop thread, and it builds an object tag with embed tags inside, and adds it to the HTMLDocument of the owning page. I verified that it gets added, in HTMLUnit by printing out the owner document of the object tag. It's URL corresponds with the page I give it. Basically, the javascript creates the object tag as a string, creates an HtmlDivElement, and then adds the object to the div, through using div.innerHtml. The div's owning document is HTMLDocument. I have a couple of questions around this. One is, are HtmlPage and HtmlDocument one and the same? and so if I call getByXPath on HTMLPage, it should access the stuff added? even through the use of the innerHtml? And, if so, is there a Page listener that listens for additions and removals of tags from the page? I'll keep digging to see if I can characterize it more. Thanks again, Teryl -- View this message in context: http://htmlunit.10904.n7.nabble.com/Question-about-scraping-object-embed-tags-from-website-tp37745p37749.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. |
From: tptaylor <ter...@gm...> - 2015-11-14 23:44:08
|
Hi Ahmed, How are HTMLObjectElement, and HtmlObject linked? I have verified that an HTMLObjectElement object is created by the Javascript and it is added to the document object, as when I check it's ownerDocument, it is the root URL given above. I also add a WebWindowListener to the WebClient. The WebListener is fired multiple times, but not when the HTMLObjectElement is added, and the HtmlObject doesn't appear in the HtmlPage object. I was thinking of adding in a DOMListener to try and see if that is triggered when the dom is changed, but I'm not sure where I would do that. Is that something you do directly after getPage is called on HtmlPage? Cheers, Teryl -- View this message in context: http://htmlunit.10904.n7.nabble.com/Question-about-scraping-object-embed-tags-from-website-tp37745p37750.html Sent from the HtmlUnit - General mailing list archive at Nabble.com. |
From: Ahmed A. <asa...@ya...> - 2015-11-16 09:38:09
|
Hi Teryl, HTMLElement and descendants are part of the JavaScript objects, and they have their DomNode assigned. DomNode is a parallel hierarchy where it handles the relation between elements. You can use htmlPage.addDomChangeListener. Also, you could post your code so far, so others can continue from where you have reached. Ahmed From: tptaylor <ter...@gm...> To: htm...@li... Sent: Sunday, November 15, 2015 12:02 AM Subject: Re: [Htmlunit-user] Question about scraping object/embed tags from website Hi Ahmed, How are HTMLObjectElement, and HtmlObject linked? I have verified that an HTMLObjectElement object is created by the Javascript and it is added to the document object, as when I check it's ownerDocument, it is the root URL given above. I also add a WebWindowListener to the WebClient. The WebListener is fired multiple times, but not when the HTMLObjectElement is added, and the HtmlObject doesn't appear in the HtmlPage object. I was thinking of adding in a DOMListener to try and see if that is triggered when the dom is changed, but I'm not sure where I would do that. Is that something you do directly after getPage is called on HtmlPage? Cheers, Teryl |