Re: [Exist-open] Page rendering

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks guys for the help.   I fancy leaving the page scaping in exist and
simply returning the full html as a response
from pupetteer - there may be issues with cookies I expect.

Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at
the HTML, not the screen image.  Puppeteer does render CSS to form an image
anyway doesn't it -  I thought its  main purpose was to automate unit
testing by generating screen shots, with possibly different screen/browser
configurations?

Chris

On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...>
wrote:

> The problem with that is that the CSS is not resolved. Hence I pointed you
> the the @cloudformatter. You can examine the JS in those pages that
> actually resolves the CSS and yields full HTML.
>
> Kevin
>
>
>
> Sent from my Verizon, Samsung Galaxy smartphone
>
>
> -------- Original message --------
> From: Roy Walter <gar...@ya...>
> Date: 3/16/22 4:38 AM (GMT-08:00)
> To: kev...@xp..., Chris Wallace <kit...@gm...>
> Cc: exist-open <exi...@li...>
> Subject: Re: [Exist-open] Page rendering
>
> Hi Chris,
>
> You can do that. (We do.)
>
> Configure a web server in nodejs and send a GET request from eXist with
> the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can
> either process the returned html/xml payload directly or PUT the file to
> eXist from nodejs.
>
> R.
>
>
> On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> Thanks Roy - yes that's just the ticket .  Actually I was wondering if one
> could set up an HTTP service based on puppeteer which took a url, rendered
> the page and returned the full page HTML so that the current scraper would
> be unchanged.  Surprised there isn't a service already to do just that -
> perhaps there is.
>
> Kevin, it's just a case of the common problem nowadays of scraping the
> Javascript-generated web we see these days.  XQuery is great for scraping
> HTML pages but Javascript-generated pages have to be rendered by a browser
> engine before we get the displayed HTML.  I wondered if anyone had wrapped
> a browser engine for use in exist-db.   Perhaps mentioning PDFs was
> confusing - I parse PDF documents using the Content Extraction Module which
> wraps Apache Tika ( does a good job although it does miss spaces sometimes
> which is a problem for text analysis) and this is unchanged.
>
> Chris
>
> On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...>
> wrote:
>
> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>
>

Re: [Exist-open] Page rendering

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Page rendering