From: Kevin B. <kev...@xp...> - 2022-03-16 18:11:45
|
The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML.KevinSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris,You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is.Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.ChrisOn Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: Can't be done.nodejs/puppeteer/Chromium is the way.-- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:https://bristoltrees.space/Planning/mapThe latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db?And no, there is no public API that I can findChris _______________________________________________Exist-open mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/exist-open |