Re: [Exist-open] Page rendering

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks Roy - yes that's just the ticket .  Actually I was wondering if one
could set up an HTTP service based on puppeteer which took a url, rendered
the page and returned the full page HTML so that the current scraper would
be unchanged.  Surprised there isn't a service already to do just that -
perhaps there is.

Kevin, it's just a case of the common problem nowadays of scraping the
Javascript-generated web we see these days.  XQuery is great for scraping
HTML pages but Javascript-generated pages have to be rendered by a browser
engine before we get the displayed HTML.  I wondered if anyone had wrapped
a browser engine for use in exist-db.   Perhaps mentioning PDFs was
confusing - I parse PDF documents using the Content Extraction Module which
wraps Apache Tika ( does a good job although it does miss spaces sometimes
which is a problem for text analysis) and this is unchanged.

Chris

On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...>
wrote:

> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

Re: [Exist-open] Page rendering

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Page rendering