Thread: [Exist-open] Page rendering

eXist-db is a feature rich Open Source native XML database

Brought to you by: deliriumsky, dizzzz, windauer, wolfgang_m

exist-open

[Exist-open] Page rendering

From: Chris W. <kit...@gm...> - 2022-03-15 19:49:13

I developed an interface to the Bristol City Planning portal which relies
on page-scraping and PDF parsing:

https://bristoltrees.space/Planning/map

The latest version of the software (by iDox) has made the change that so
many sites have done, generating pages in JavaScript.  This would require
first rendering each page and scraping the result.  Any thoughts on doing
this in exist_db?

And no, there is no public API that I can find

Chris

Re: [Exist-open] Page rendering

From: Chris W. <kit...@gm...> - 2022-03-15 23:14:21

Thanks Roy - yes that's just the ticket .  Actually I was wondering if one
could set up an HTTP service based on puppeteer which took a url, rendered
the page and returned the full page HTML so that the current scraper would
be unchanged.  Surprised there isn't a service already to do just that -
perhaps there is.

Kevin, it's just a case of the common problem nowadays of scraping the
Javascript-generated web we see these days.  XQuery is great for scraping
HTML pages but Javascript-generated pages have to be rendered by a browser
engine before we get the displayed HTML.  I wondered if anyone had wrapped
a browser engine for use in exist-db.   Perhaps mentioning PDFs was
confusing - I parse PDF documents using the Content Extraction Module which
wraps Apache Tika ( does a good job although it does miss spaces sometimes
which is a problem for text analysis) and this is unchanged.

Chris

On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...>
wrote:

> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

Re: [Exist-open] Page rendering

From: Kevin B. <kev...@xp...> - 2022-03-15 23:25:25

See my system @cloudformatter. But it is “in browser” … but it scrapes the HTML, converts that HTML to XSL FO and uses RenderX to format the page.

Post Javascript.

 

www.cloudformatter.com/css2pdf <http://www.cloudformatter.com/css2pdf> 

 

We have many installations of this sitting on top of eXist.

 

Kevin

 

From: Chris Wallace <kit...@gm...> 
Sent: Tuesday, March 15, 2022 4:14 PM
To: Roy Walter <gar...@ya...>; kev...@xp...
Cc: exist-open <exi...@li...>
Subject: Re: [Exist-open] Page rendering

 

Thanks Roy - yes that's just the ticket .  Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged.  Surprised there isn't a service already to do just that - perhaps there is.

 

Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days.  XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML.  I wondered if anyone had wrapped a browser engine for use in exist-db.   Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.

 

Chris

 

On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya... <mailto:gar...@ya...> > wrote:

Can't be done.

 

nodejs/puppeteer/Chromium is the way.

 

-- Roy

 

 

 

On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm... <mailto:kit...@gm...> > wrote: 

 

 

I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:

 

https://bristoltrees.space/Planning/map

 

The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?

 

And no, there is no public API that I can find

 

Chris

_______________________________________________
Exist-open mailing list
Exi...@li... <mailto:Exi...@li...> 
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Roy W. <gar...@ya...> - 2022-03-16 11:38:44

 Hi Chris,
You can do that. (We do.) 

Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.
R.

    On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote:  
 
 Thanks Roy - yes that's just the ticket .  Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged.  Surprised there isn't a service already to do just that - perhaps there is.
Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days.  XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML.  I wondered if anyone had wrapped a browser engine for use in exist-db.   Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.
Chris
On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote:

 Can't be done.
nodejs/puppeteer/Chromium is the way.
-- Roy



    On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote:  
 
 I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:
https://bristoltrees.space/Planning/map
The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?
And no, there is no public API that I can find
Chris_______________________________________________
Exist-open mailing list
Exi...@li...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Kevin B. <kev...@xp...> - 2022-03-16 18:11:45

The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML.KevinSent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22  4:38 AM  (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering 
        Hi Chris,You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.R.
        
        
            
                
                
                    On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote:
                
                
                
                Thanks Roy - yes that's just the ticket .  Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged.  Surprised there isn't a service already to do just that - perhaps there is.Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days.  XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML.  I wondered if anyone had wrapped a browser engine for use in exist-db.   Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.ChrisOn Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote:
        Can't be done.nodejs/puppeteer/Chromium is the way.-- Roy
        
        
            
                
                
                    On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote:
                
                
                
                I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:https://bristoltrees.space/Planning/mapThe latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?And no, there is no public API that I can findChris
_______________________________________________Exist-open mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Chris W. <kit...@gm...> - 2022-03-16 18:40:27

Thanks guys for the help.   I fancy leaving the page scaping in exist and
simply returning the full html as a response
from pupetteer - there may be issues with cookies I expect.

Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at
the HTML, not the screen image.  Puppeteer does render CSS to form an image
anyway doesn't it -  I thought its  main purpose was to automate unit
testing by generating screen shots, with possibly different screen/browser
configurations?

Chris

On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...>
wrote:

> The problem with that is that the CSS is not resolved. Hence I pointed you
> the the @cloudformatter. You can examine the JS in those pages that
> actually resolves the CSS and yields full HTML.
>
> Kevin
>
>
>
> Sent from my Verizon, Samsung Galaxy smartphone
>
>
> -------- Original message --------
> From: Roy Walter <gar...@ya...>
> Date: 3/16/22 4:38 AM (GMT-08:00)
> To: kev...@xp..., Chris Wallace <kit...@gm...>
> Cc: exist-open <exi...@li...>
> Subject: Re: [Exist-open] Page rendering
>
> Hi Chris,
>
> You can do that. (We do.)
>
> Configure a web server in nodejs and send a GET request from eXist with
> the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can
> either process the returned html/xml payload directly or PUT the file to
> eXist from nodejs.
>
> R.
>
>
> On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> Thanks Roy - yes that's just the ticket .  Actually I was wondering if one
> could set up an HTTP service based on puppeteer which took a url, rendered
> the page and returned the full page HTML so that the current scraper would
> be unchanged.  Surprised there isn't a service already to do just that -
> perhaps there is.
>
> Kevin, it's just a case of the common problem nowadays of scraping the
> Javascript-generated web we see these days.  XQuery is great for scraping
> HTML pages but Javascript-generated pages have to be rendered by a browser
> engine before we get the displayed HTML.  I wondered if anyone had wrapped
> a browser engine for use in exist-db.   Perhaps mentioning PDFs was
> confusing - I parse PDF documents using the Content Extraction Module which
> wraps Apache Tika ( does a good job although it does miss spaces sometimes
> which is a problem for text analysis) and this is unchanged.
>
> Chris
>
> On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...>
> wrote:
>
> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>
>

Re: [Exist-open] Page rendering

From: Roy W. <gar...@ya...> - 2022-03-16 19:46:28

Hi Chris,,
That's what I meant. Just grab (scrape!) the page HTML in nodejs and pass it back to eXist for processing. You shouldn't have any cookie issues if you configure puppeteer the right way. See puppeteer-extra. Just let you existing scraper in eXist do its thing with what you get back from nodejs.
R.

On Wednesday, 16 March 2022, 18:39:45 GMT, Chris Wallace <kit...@gm...> wrote:

Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations?

Chris
On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...> wrote:

The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML.
Kevin

Sent from my Verizon, Samsung Galaxy smartphone

-------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering
Hi Chris,
You can do that. (We do.)

Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.
R.

On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote:

Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is.
Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.
Chris
On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote:

Can't be done.
nodejs/puppeteer/Chromium is the way.
-- Roy

On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote:

I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:
https://bristoltrees.space/Planning/map
The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db?
And no, there is no public API that I can find
Chris_______________________________________________
Exist-open mailing list
Exi...@li...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Kevin B. <kev...@xp...> - 2022-03-17 00:59:25

Yes my bad. I thought (well because I am part owner of RenderX) that you were looking at rendering the page and not just scraping it.

 

…. When your favorite tool is a hammer, everything looks like a nail ….

 

Kevin

 

From: Roy Walter <gar...@ya...> 
Sent: Wednesday, March 16, 2022 12:15 PM
To: Kevin Brown <kev...@xp...>; wes...@ja...; Chris Wallace <kit...@gm...>
Cc: exist-open <exi...@li...>
Subject: Re: [Exist-open] Page rendering

 

Hi Chris,,

 

That's what I  meant. Just grab (scrape!) the page HTML in nodejs and pass it back to eXist for processing. You shouldn't have any cookie issues if you configure puppeteer the right way. See puppeteer-extra. Just let you existing scraper in eXist do its thing with what you get back from nodejs.

 

R.

 

 

 

On Wednesday, 16 March 2022, 18:39:45 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: 

 

 

Thanks guys for the help.   I fancy leaving the page scaping in exist and simply returning the full html as a response 

from pupetteer - there may be issues with cookies I expect.

  

Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image.  Puppeteer does render CSS to form an image anyway doesn't it -  I thought its  main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations?

 

Chris

 

On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown < <mailto:kev...@xp...> kev...@xp...> wrote:

The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML.

 

Kevin

 

 

 

Sent from my Verizon, Samsung Galaxy smartphone

 

 

-------- Original message --------

From: Roy Walter < <mailto:gar...@ya...> gar...@ya...> 

Date: 3/16/22 4:38 AM (GMT-08:00) 

To:  <mailto:kev...@xp...> kev...@xp..., Chris Wallace < <mailto:kit...@gm...> kit...@gm...> 

Cc: exist-open < <mailto:exi...@li...> exi...@li...> 

Subject: Re: [Exist-open] Page rendering 

 

Hi Chris,

 

You can do that. (We do.) 

 

Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.

 

R.

 

 

On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: 

 

 

Thanks Roy - yes that's just the ticket .  Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged.  Surprised there isn't a service already to do just that - perhaps there is.

 

Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days.  XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML.  I wondered if anyone had wrapped a browser engine for use in exist-db.   Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.

 

Chris

 

On Tue, Mar 15, 2022 at 8:23 PM Roy Walter < <mailto:gar...@ya...> gar...@ya...> wrote:

Can't be done.

 

nodejs/puppeteer/Chromium is the way.

 

-- Roy

 

 

 

On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: 

 

 

I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:

 

 <https://bristoltrees.space/Planning/map> https://bristoltrees.space/Planning/map

 

The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?

 

And no, there is no public API that I can find

 

Chris

_______________________________________________
Exist-open mailing list
 <mailto:Exi...@li...> Exi...@li...
 <https://lists.sourceforge.net/lists/listinfo/exist-open> https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Roy W. <gar...@ya...> - 2022-03-15 23:15:48

 Can't be done.
nodejs/puppeteer/Chromium is the way.
-- Roy



    On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote:  
 
 I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:
https://bristoltrees.space/Planning/map
The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?
And no, there is no public API that I can find
Chris_______________________________________________
Exist-open mailing list
Exi...@li...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: [Exist-open] Page rendering

From: Michael W. <wes...@ja...> - 2022-03-16 00:35:57

I second this approach. Here is the relevant code needed to do it from the
eXist side:

...
declare variable $pages:GET-PUPPETEER-PAGE := 'get-puppet-page.js';
declare variable $pages:NODE-BIN := '/usr/local/bin/node';
declare variable $pages:NODE-OPTIONS := <options>
  <workingDir>/home/my-user/bin</workingDir>
  <environment>
    <env name="CHROME_DEVEL_SANDBOX"
value="/usr/local/sbin/chrome-devel-sandbox"/>
  </environment>
</options>;
declare variable $pages:STRIP-LINES :=
util:function(xs:QName('local:strip-lines'), 1);
declare function local:strip-lines($text as xs:string) as xs:string {
  replace(replace(replace($text, '<line>', ''), '</line>', '
'), '<line/>', '
')
};
...

declare function pages:puppet-page($url) as xs:string? {

  let $cmd := ($pages:NODE-BIN, $pages:GET-PUPPETEER-PAGE, $url)
  let $raw-page := process:execute($cmd, $pages:NODE-OPTIONS)
  return if (exists($raw-page))
    then util:call($pages:STRIP-LINES, $raw-page)
    else ()
};


The STRIP-LINES stuff is there because process:execute returns each line of
the result with <LINE>contents</LINE>, which needs to be stripped out
before XHTML-izing the rest. This was originally written for eXist 2.x, so
I didn't use pipes in the filter. If I were to rewrite it now, it would
probably be like:

declare function local:strip-lines($text as xs:string) as xs:string {
  $text
  => replace('<line>','')
  => replace('</line'>,'
')
};


Notice that the CR is within the replacement quotes in the second replace.
It helps a lot for readability of the results.

I also have a whole slew of other checks in my version, along with caching
of the results that are not included here.

Hope this helps get you started.

Take care.


2022年3月16日(水) 8:17 Roy Walter via Exist-open <
exi...@li...>:

> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>


-- 
Michael Westbay
Writer/System Administrator
http://www.japanesebaseball.com/

Re: [Exist-open] Page rendering

From: Kevin B. <kev...@xp...> - 2022-03-16 00:31:48

What is it that you are looking for?

Rendering content to PDF?

Scraping content from PDF?
Both?

 

I think that you need to explain a little more to help us understand what you are looking for exactly … I for one am confused.

 

Kevin Brown
RenderX

 

From: Chris Wallace <kit...@gm...> 
Sent: Tuesday, March 15, 2022 12:49 PM
To: exist-open <Exi...@li...>
Subject: [Exist-open] Page rendering

 

I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:

 

https://bristoltrees.space/Planning/map

 

The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript.  This would require first rendering each page and scraping the result.  Any thoughts on doing this in exist_db?

 

And no, there is no public API that I can find

 

Chris