From: Kevin B. <kev...@xp...> - 2022-03-15 23:25:25
|
See my system @cloudformatter. But it is “in browser” … but it scrapes the HTML, converts that HTML to XSL FO and uses RenderX to format the page. Post Javascript. www.cloudformatter.com/css2pdf <http://www.cloudformatter.com/css2pdf> We have many installations of this sitting on top of eXist. Kevin From: Chris Wallace <kit...@gm...> Sent: Tuesday, March 15, 2022 4:14 PM To: Roy Walter <gar...@ya...>; kev...@xp... Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya... <mailto:gar...@ya...> > wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm... <mailto:kit...@gm...> > wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris _______________________________________________ Exist-open mailing list Exi...@li... <mailto:Exi...@li...> https://lists.sourceforge.net/lists/listinfo/exist-open |