From: Tobias C. <ma...@ce...> - 2013-12-18 13:36:21
|
Hello, this is my first contribution to list. I am a hobbyist and just try out to scrape a site which gives me a table with some informations of interest for me. As you can see in the code below I am loading a page and clicking some links and submit a form. This works very well. Then things going complicated: In a browser you see a button. Which is created by an server side script (aspx). Clicking on the button another server side script is executed. This is how I get the link to the second server side script: link = page.getByXPath("//a[@onclick]")[0] Now I have a html-page. In this page there is a iframe with a nested server side script (aspx). With frame = page.getFrames().get(0) and page = frame.getEnclosedPage() I succesfully retrieve the html-page of this first frame. In this html-page there are again two server side scripts nested. But if I try to receive the html-page I just retrieve a JavaScriptPage with the content: „you cannot open directly.“ For me these ugly nesting of Frames and server side scripts is to avoid scraping. Can anybody help me ? Regards, Tobias Here is my code: import com.gargoylesoftware.htmlunit.WebClient as WebClient import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion def main(): webclient = WebClient(BrowserVersion.FIREFOX_17) url = „<URL>" page = webclient.getPage(url) print "new page loaded: "+url link = page.getByXPath("//a[@href='index.php?id=733']")[1] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() form = page.getByXPath("//form[@action='index.php?id=intern']")[0] user = form.getInputByName("user") passw = form.getInputByName("pass") button = form.getInputByName("submit") user.setValueAttribute("cgl") passw.setValueAttribute("rattamahatta") page = button.click() print "form submitted and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@href='index.php?id=837']")[0] page = link.click() print "link clicked and new page loaded: "+page.getUrl().toString() link = page.getByXPath("//a[@onclick]")[0] page = link.click() print "button clicked and new page loaded: "+page.getUrl().toString() frame = page.getFrames().get(0) page = frame.getEnclosedPage() print "new iframe loaded: "+page.getUrl().toString() frames = page.getFrames() page1 = frames.get(0).getEnclosedPage() print "new iframe loaded: "+page1.getUrl().toString() # break if __name__ == '__main__': main() |