From: David M. G. <mic...@gm...> - 2013-12-17 07:29:01
|
Hi all, For the following link: http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 I want to get the address of the pdf. When pressing on the pdf link with the text "Article in pdf format" i get to a refresh page. How do i know that i am now in a refresh page and how many seconds i will wait? Alternatively, how can i get the pdf link without waiting. I tried to find a solution on my own I wanted somehow to parse the redirect page source and get the following tag (which you get with view source): <meta name="added" content="7;URL= http://www.scielo.org.co/pdf/rcg/v27s2/v27s2a02.pdf" http-equiv="refresh"> When trying to make the same with htmlunit: public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { WebClient webClient = new WebClient(); List<String> urls = ImmutableList.of(" http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-99572012000600002 "); for(String url:urls) { HtmlPage page = webClient.getPage(url); List<HtmlAnchor> anchors = page.getAnchors(); for (HtmlAnchor anchor:anchors) { String linkText = anchor.getTextContent(); if(linkText.contains("pdf")) { Page pdfPage = anchor.click(); if(pdfPage.isHtmlPage()) { HtmlPage p = (HtmlPage) pdfPage; System.out.println(p.asXml()); } } } } } I don't get this tag in the resulting xml. In summary i have the following questions: How do i know that i am now in a refresh page and how many seconds i will wait? After identifying that i am in a refresh page, how do i wait and refresh after this number of seconds? how can i get the pdf link without waiting. Where did the <meta name="added" content="7;URL= http://www.scielo.org.co/pdf/rcg/v27s2/v27s2a02.pdf" http-equiv="refresh"> tag disappear? Thanks, David |