From: Chris W. <kit...@gm...> - 2022-03-15 23:14:21
|
Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > |