From: Damon G. <dam...@ho...> - 2020-06-04 11:09:42
|
Hi, Thanks again. Switching to PhantomJS had occurred to me and I had started to look at that as an option - it seems clear that the guts of the content of the pages I am trying to access are generated through a script (or scripts) that crash HTMLUnit. I have a feeling this might be intentional. Its a shame rhino is so tight into HTMLUnit. I wonder if the "couplings" could be loosened sufficiently to make the script engine a parameter in much the same way that the browser is a parameter through BrowserVersion. Regards, Damon Goodyear CONFIDENTIALITY and NON-DISCLOSURE OF EMAIL ADDRESS: This email, including its content and the address of the sender, are provided for the use of the recipient only and for the purposes of the subject matter under discussion. Notwithstanding any other consent that may have been given neither the content of this email nor the address of the sender may be disclosed to third parties, including within the same undertaking, without the prior written permission of the sender. Where this email has been sent to a general or non-personal email address permission is granted for a reply to be made, on the subject matter only, from the undertaking to whom it is addressed. ________________________________ From: Vasudevan Comandur <vco...@gm...> Sent: 04 June 2020 10:40 To: htm...@li... <htm...@li...> Subject: Re: [Htmlunit-user] Exception thrown by webClient.getPage Hi Damon, HTMLUnit is tightly coupled with Rhino JS engine. Changing to a different engine is a tedious task. Selenium uses HTMLUnit driver and see if you can use PhantomJS. As an alternative, you can switch to use PhantomJS instead of HTMLUnit. Regards Vasu On Thu, 4 Jun 2020 at 13:38, Damon Goodyear <dam...@ho...<mailto:dam...@ho...>> wrote: Hi, Thanks Vasudevan for your reply. In fact I did try that too without success - I just get the top and bottom of the page and not the numbers (the important bit) in between. While getting ready for this morning I did wonder if changing the javascript engine might work. If I understand correctly HTMLUnit uses Rhino. Presumably there are other engines available - is it possible to change to a different engine or would that involve a code change? Thanks, Damon Goodyear CONFIDENTIALITY and NON-DISCLOSURE OF EMAIL ADDRESS: This email, including its content and the address of the sender, are provided for the use of the recipient only and for the purposes of the subject matter under discussion. Notwithstanding any other consent that may have been given neither the content of this email nor the address of the sender may be disclosed to third parties, including within the same undertaking, without the prior written permission of the sender. Where this email has been sent to a general or non-personal email address permission is granted for a reply to be made, on the subject matter only, from the undertaking to whom it is addressed. ________________________________ From: Vasudevan Comandur <vco...@gm...<mailto:vco...@gm...>> Sent: 04 June 2020 08:33 To: htm...@li...<mailto:htm...@li...> <htm...@li...<mailto:htm...@li...>> Subject: Re: [Htmlunit-user] Exception thrown by webClient.getPage Hi, See if you can extract the necessary data from the HTML response by disabling the Javascript in WebClient. Regards Vasu On Thu, 4 Jun 2020 at 02:35, Damon Goodyear <dam...@ho...<mailto:dam...@ho...>> wrote: Hi, I am new to HTMLUnit and (re)new to seeking help from mailing lists like this - the last time I tried was about 15 years ago and things seem to have moved on. I hope my question is useful to you and I hope, even more, that your answer is useful to me. I hope I am not committing the newbie sin of asking a well known issue/non-issue. I have encountered a problem from the beginning with HTMLUnit. I have been trying to use HTMLUnit to download information from the following URLs- https://www.londonstockexchange.com/stock/OPM/1pm-plc/fundamentals<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fstock%2FOPM%2F1pm-plc%2Ffundamentals&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855528744&sdata=cA62wIOQw6ZsW7ZU2S0qgtVK2EzyjMQrwHrKVblhA%2FI%3D&reserved=0> https://www.londonstockexchange.com/live-markets/market-data-dashboard/price-explorer?page=1<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Flive-markets%2Fmarket-data-dashboard%2Fprice-explorer%3Fpage%3D1&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855538742&sdata=M9%2BLxh8NWcJazsPxFWR9Aawj7C0dKNAJtowqg48hedQ%3D&reserved=0> https://www.londonstockexchange.com/stock/OPM/1pm-plc/company-page<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fstock%2FOPM%2F1pm-plc%2Fcompany-page&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855548726&sdata=1HSjNNK0QnhRJccXo%2F6jIhvUYY3IDb5xyyvD9EI7OIo%3D&reserved=0> These are listed in order of importance to me. The reason for this is that all this information has changed format in the last week or so and has become much harder to unpick. If I try the following code-- WebClient wc = new WebClient(BrowserVersion.CHROME); // LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog"); // java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF); // java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); try { HtmlPage page = wc.getPage("https://www.londonstockexchange.com/stock/OPM/1pm-plc/fundamentals<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fstock%2FOPM%2F1pm-plc%2Ffundamentals&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855558720&sdata=ytQWga4yHXli7%2B5%2BeQQi61mTcyDbpnzAvZ4aEZx6xt4%3D&reserved=0>"); String s = page.asText(); System.out.print(s); wc.close(); } catch etc... ... then I get a bucket load of javascript warnings that I can suppress by uncommenting the commented lines above, followed by an exception I do not understand and cannot find any help on, here or in the wider internet. The exception starts with the lines-- EcmaError: lineNumber=[1] column=[0] lineSource=[<no source>] name=[ReferenceError] sourceName=[https://www.londonstockexchange.com:443/polyfills-es5.463681aba2540d60831f.js#1(Function)<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fpolyfills-es5.463681aba2540d60831f.js%231(Function)&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855568717&sdata=la7r5Cdpesfm87Xw7i5OEDMGkJgEk1Ajs6IA2CQnLag%3D&reserved=0>] message=[ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (https://www.londonstockexchange.com:443/polyfills-es5.463681aba2540d60831f.js#1(Function)#1)]<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fpolyfills-es5.463681aba2540d60831f.js%231(Function)%25231)%5D&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855578712&sdata=swZJhU0PbNOjxpIyT1qvu7tRyrxXxkjPd1lp2TWwPKU%3D&reserved=0> com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (https://www.londonstockexchange.com:443/polyfills-es5.463681aba2540d60831f.js#1(Function)#1)<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.londonstockexchange.com%2Fpolyfills-es5.463681aba2540d60831f.js%231(Function)%25231)&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855588706&sdata=v8szm%2BP9oWegVcW7Db7MItDT1rBBIzP3zMoJZ55pA14%3D&reserved=0> at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:891) I can send the full, very long stack, trace if that would help. Are you able to assist - either by fixing the bug (if it is one) or advising me how to get around this. Thanks, Damon Goodyear CONFIDENTIALITY and NON-DISCLOSURE OF EMAIL ADDRESS: This email, including its content and the address of the sender, are provided for the use of the recipient only and for the purposes of the subject matter under discussion. Notwithstanding any other consent that may have been given neither the content of this email nor the address of the sender may be disclosed to third parties, including within the same undertaking, without the prior written permission of the sender. Where this email has been sent to a general or non-personal email address permission is granted for a reply to be made, on the subject matter only, from the undertaking to whom it is addressed. _______________________________________________ Htmlunit-user mailing list Htm...@li...<mailto:Htm...@li...> https://lists.sourceforge.net/lists/listinfo/htmlunit-user<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fhtmlunit-user&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855598703&sdata=rX362H9e7VCSKssLksmfujIwkKw%2Bvv%2BRv2W8nw9IyLw%3D&reserved=0> _______________________________________________ Htmlunit-user mailing list Htm...@li...<mailto:Htm...@li...> https://lists.sourceforge.net/lists/listinfo/htmlunit-user<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fhtmlunit-user&data=02%7C01%7C%7C1662ac761cc747b3bd2508d8086b7247%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637268604855608702&sdata=bgt9BTrnjyHw%2BdafRbbgwLYllzL%2BoELpPc0PmhmDsKg%3D&reserved=0> |