|
From: Oscar B. <oba...@um...> - 2020-03-21 06:29:28
|
Hello, I have been trying to build a web scraping tool to obtain a string from a dynamically-loaded webpage. My objective was to obtain the string COC1=CC2=C(C=C1)NC3=C2CCNC3 from the section titled "Canonical SMILES" from the following website: https://pubchem.ncbi.nlm.nih.gov/compound/1868 Previously, I had working HTMLUnit code to access the above (thanks Ronald), but now the code is not working! Whereas before I would get printouts to the screen of my target information (COC1=CC2=C(C=C1)NC3=C2CCNC3 from the "Canonical Smiles" section of the above website - this information is displayed dynamically on the website), now instead, HTMLUnit returns the following error: *SEVERE: ReferenceError: "fetch" is not defined* Looking this error up, it seems this "fetch" has something to do with requests and responses across the network when accessing the webpage. In short, it seems like it's something implemented on the end of the owner of the web page, not something I can inadvertently modify, and viewing the webpage on a normal browser, the website looks fine, just as it always has in the past. Would someone please tell me what is going on here? The code was working perfectly one minute, but then yielded the above fetch error the next. Here is the main method of my previously functional code: public static void main(String[] args) throws Exception { String uri = "https://pubchem.ncbi.nlm.nih.gov/compound/1868"; try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) { // do not stop on js errors webClient.getOptions().setThrowExceptionOnScriptError(false); // do not log js errors (usually using this is a bad idea, at least // if you are hunting for problems). webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener()); HtmlPage page = webClient.getPage(uri); webClient.waitForBackgroundJavaScriptStartingBefore(10_000); final DomNodeList<DomNode> divs = page.querySelectorAll("#Canonical-SMILES .section-content .section-content-item p"); for (DomNode div : divs) { System.out.println("----------------"); System.out.println(div.asXml()); System.out.println("----------------"); System.out.println(div.asText()); System.out.println("----------------"); } } } Oscar B. |
|
From: Ronald B. <rb...@rb...> - 2020-03-22 12:55:09
|
Hi Oscar,
that is not so uncommon. From time to time the providers are changing there web pages by using other js library or only newer versions.
In you case it look like the page js code is change and now uses the fetch api. This api is not supported b HtmlUnit at the moment.
With a bit of luck you can ignore this.
Add
webClient.getOptions().setThrowExceptionOnScriptError(false);
directly after the webClient creation.
Hope that helps.
If not please open an issue at github and i will have a deeper look.
RBRi
On Sat, 21 Mar 2020 01:27:07 -0500 Oscar Bastidas wrote:
>Hello,
>
>I have been trying to build a web scraping tool to obtain a string from a
>dynamically-loaded webpage.
>
>My objective was to obtain the string COC1=CC2=C(C=C1)NC3=C2CCNC3 from the
>section titled "Canonical SMILES" from the following website:
>https://pubchem.ncbi.nlm.nih.gov/compound/1868
>
>Previously, I had working HTMLUnit code to access the above (thanks
>Ronald), but now the code is not working! Whereas before I would get
>printouts to the screen of my target information (COC1=CC2=C(C=C1)NC3=C2CCNC3
>from the "Canonical Smiles" section of the above website - this information
>is displayed dynamically on the website), now instead, HTMLUnit returns the
>following error:
>
>*SEVERE: ReferenceError: "fetch" is not defined*
>
>Looking this error up, it seems this "fetch" has something to do with
>requests and responses across the network when accessing the webpage. In
>short, it seems like it's something implemented on the end of the owner of
>the web page, not something I can inadvertently modify, and viewing the
>webpage on a normal browser, the website looks fine, just as it always has
>in the past.
>
>Would someone please tell me what is going on here? The code was working
>perfectly one minute, but then yielded the above fetch error the next.
>
>Here is the main method of my previously functional code:
>
>public static void main(String[] args) throws Exception {
> String uri = "https://pubchem.ncbi.nlm.nih.gov/compound/1868";
>
> try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX))
>{
> // do not stop on js errors
> webClient.getOptions().setThrowExceptionOnScriptError(false);
> // do not log js errors (usually using this is a bad idea, at
>least
> // if you are hunting for problems).
> webClient.setJavaScriptErrorListener(new
>SilentJavaScriptErrorListener());
>
> HtmlPage page = webClient.getPage(uri);
> webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
>
> final DomNodeList<DomNode> divs =
>page.querySelectorAll("#Canonical-SMILES
>..section-content .section-content-item p");
> for (DomNode div : divs) {
> System.out.println("----------------");
> System.out.println(div.asXml());
> System.out.println("----------------");
> System.out.println(div.asText());
> System.out.println("----------------");
> }
> }
> }
>
>Oscar B.
>
>
>
>----< Inline text [text-plain-04.txt] >------------------
>
>
>
>
>----< Inline text [text-plain-05.txt] >------------------
>
>_______________________________________________
>Htmlunit-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlunit-user
>
>
|
|
From: Ronald B. <rb...@rb...> - 2020-03-22 13:08:21
|
Hi Oscar, i will try a simple anwer - if you like to automate a web page it will help to know as much as possible about all the strange technologies working together to bring the content to your screen at least * html * css * javascript * xml Usually i use the developer tools from firefox (or CHROME if you like to sponsor this company ;-) to unalyze the page a bit. You can start with a right click on the content you are interested in and select inspect Element. Then you have to decide for a way to locate the element. The differnt option are introduces at http://htmlunit.sourceforge.net/gettingStarted.html under the Finding a specific element topic. Hope that helps RBRi On Sat, 21 Mar 2020 01:27:07 -0500 Oscar Bastidas wrote: > >Hi Ronald, > >I had one other quick question if it's not too much trouble: would you >please tell me how you knew to search for "#Canonical-SMILES" in the code >you sent me? Since the word "SMILES" is nowhere to be found in the source >HTML, I was curious as to how you knew to search specifically for >"#Canonical-SMILES" >in the actual Java code (knowing this would help me scrape for other >strings on the dynamic webpage). > >Lastly, are there any resources you specifically recommend or liked for >learning how to do the kind of HTMLUnit webscraping you've helped me with >here? If it's just reading general tutorials online, that's ok, it's where >I'm starting now. Thanks again. > >Oscar B. > >On Sun, Mar 15, 2020, 9:22 AM Ronald Brill <rb...@rb...> wrote: > >> Hi Oscar, >> >> this code works for me >> >> public static void main(String[] args) throws Exception { >> String uri = "https://pubchem.ncbi.nlm.nih.gov/compound/1868"; >> >> try (final WebClient webClient = new >> WebClient(BrowserVersion.FIREFOX)) { >> // do not stop on js errors >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> // do not log js errors (usually using this is a bad idea, at >> least >> // if you are hunting for problems). >> webClient.setJavaScriptErrorListener(new >> SilentJavaScriptErrorListener()); >> >> HtmlPage page = webClient.getPage(uri); >> webClient.waitForBackgroundJavaScriptStartingBefore(10_000); >> >> final DomNodeList<DomNode> divs = >> page.querySelectorAll("#Canonical-SMILES .section-content >> .section-content-item p"); >> for (DomNode div : divs) { >> System.out.println("----------------"); >> System.out.println(div.asXml()); >> System.out.println("----------------"); >> System.out.println(div.asText()); >> System.out.println("----------------"); >> } >> } >> } >> >> RBRi >> >> >> On Tue, 10 Mar 2020 12:54:31 -0500 Oscar Bastidas wrote: >> > >> >Hello, >> > >> >I am trying to make a copy of/obtain a string that appears on a webpage >> >when the webpage loads on my browser but when I look at the HTML code of >> >the webpage in question, I do not see the string at all (it is no where to >> >be found in the HTML code). >> > >> >Here is the URL: >> >https://pubchem.ncbi.nlm.nih.gov/compound/1868 >> > >> >and here is my target string: >> >COC1=CC2=C(C=C1)NC3=C2CCNC3 >> > >> >The above target string is found under the heading of "2.1.4 Canonical >> >SMILES" (this heading doesn't appear either in the HTML code). >> > >> >Could someone please tell me if this is a special case that cannot be >> >scraped? Thanks. >> > >> >Oscar B. >> > >> > >> > >> >----< Inline text [text-plain-04.txt] >------------------ >> > >> > >> > >> > >> >----< Inline text [text-plain-05.txt] >------------------ >> > >> >_______________________________________________ >> >Htmlunit-user mailing list >> >Htm...@li... >> >https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> > >> > >> >> > > |