|
From: Ronald B. <rb...@rb...> - 2020-03-22 13:08:21
|
Hi Oscar, i will try a simple anwer - if you like to automate a web page it will help to know as much as possible about all the strange technologies working together to bring the content to your screen at least * html * css * javascript * xml Usually i use the developer tools from firefox (or CHROME if you like to sponsor this company ;-) to unalyze the page a bit. You can start with a right click on the content you are interested in and select inspect Element. Then you have to decide for a way to locate the element. The differnt option are introduces at http://htmlunit.sourceforge.net/gettingStarted.html under the Finding a specific element topic. Hope that helps RBRi On Sat, 21 Mar 2020 01:27:07 -0500 Oscar Bastidas wrote: > >Hi Ronald, > >I had one other quick question if it's not too much trouble: would you >please tell me how you knew to search for "#Canonical-SMILES" in the code >you sent me? Since the word "SMILES" is nowhere to be found in the source >HTML, I was curious as to how you knew to search specifically for >"#Canonical-SMILES" >in the actual Java code (knowing this would help me scrape for other >strings on the dynamic webpage). > >Lastly, are there any resources you specifically recommend or liked for >learning how to do the kind of HTMLUnit webscraping you've helped me with >here? If it's just reading general tutorials online, that's ok, it's where >I'm starting now. Thanks again. > >Oscar B. > >On Sun, Mar 15, 2020, 9:22 AM Ronald Brill <rb...@rb...> wrote: > >> Hi Oscar, >> >> this code works for me >> >> public static void main(String[] args) throws Exception { >> String uri = "https://pubchem.ncbi.nlm.nih.gov/compound/1868"; >> >> try (final WebClient webClient = new >> WebClient(BrowserVersion.FIREFOX)) { >> // do not stop on js errors >> webClient.getOptions().setThrowExceptionOnScriptError(false); >> // do not log js errors (usually using this is a bad idea, at >> least >> // if you are hunting for problems). >> webClient.setJavaScriptErrorListener(new >> SilentJavaScriptErrorListener()); >> >> HtmlPage page = webClient.getPage(uri); >> webClient.waitForBackgroundJavaScriptStartingBefore(10_000); >> >> final DomNodeList<DomNode> divs = >> page.querySelectorAll("#Canonical-SMILES .section-content >> .section-content-item p"); >> for (DomNode div : divs) { >> System.out.println("----------------"); >> System.out.println(div.asXml()); >> System.out.println("----------------"); >> System.out.println(div.asText()); >> System.out.println("----------------"); >> } >> } >> } >> >> RBRi >> >> >> On Tue, 10 Mar 2020 12:54:31 -0500 Oscar Bastidas wrote: >> > >> >Hello, >> > >> >I am trying to make a copy of/obtain a string that appears on a webpage >> >when the webpage loads on my browser but when I look at the HTML code of >> >the webpage in question, I do not see the string at all (it is no where to >> >be found in the HTML code). >> > >> >Here is the URL: >> >https://pubchem.ncbi.nlm.nih.gov/compound/1868 >> > >> >and here is my target string: >> >COC1=CC2=C(C=C1)NC3=C2CCNC3 >> > >> >The above target string is found under the heading of "2.1.4 Canonical >> >SMILES" (this heading doesn't appear either in the HTML code). >> > >> >Could someone please tell me if this is a special case that cannot be >> >scraped? Thanks. >> > >> >Oscar B. >> > >> > >> > >> >----< Inline text [text-plain-04.txt] >------------------ >> > >> > >> > >> > >> >----< Inline text [text-plain-05.txt] >------------------ >> > >> >_______________________________________________ >> >Htmlunit-user mailing list >> >Htm...@li... >> >https://lists.sourceforge.net/lists/listinfo/htmlunit-user >> > >> > >> >> > > |