Re: [Exist-open] Page rendering

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I second this approach. Here is the relevant code needed to do it from the
eXist side:

...
declare variable $pages:GET-PUPPETEER-PAGE := 'get-puppet-page.js';
declare variable $pages:NODE-BIN := '/usr/local/bin/node';
declare variable $pages:NODE-OPTIONS := <options>
  <workingDir>/home/my-user/bin</workingDir>
  <environment>
    <env name="CHROME_DEVEL_SANDBOX"
value="/usr/local/sbin/chrome-devel-sandbox"/>
  </environment>
</options>;
declare variable $pages:STRIP-LINES :=
util:function(xs:QName('local:strip-lines'), 1);
declare function local:strip-lines($text as xs:string) as xs:string {
  replace(replace(replace($text, '<line>', ''), '</line>', '
'), '<line/>', '
')
};
...

declare function pages:puppet-page($url) as xs:string? {

  let $cmd := ($pages:NODE-BIN, $pages:GET-PUPPETEER-PAGE, $url)
  let $raw-page := process:execute($cmd, $pages:NODE-OPTIONS)
  return if (exists($raw-page))
    then util:call($pages:STRIP-LINES, $raw-page)
    else ()
};

The STRIP-LINES stuff is there because process:execute returns each line of
the result with <LINE>contents</LINE>, which needs to be stripped out
before XHTML-izing the rest. This was originally written for eXist 2.x, so
I didn't use pipes in the filter. If I were to rewrite it now, it would
probably be like:

declare function local:strip-lines($text as xs:string) as xs:string {
  $text
  => replace('<line>','')
  => replace('</line'>,'
')
};

Notice that the CR is within the replacement quotes in the second replace.
It helps a lot for readability of the results.

I also have a whole slew of other checks in my version, along with caching
of the results that are not included here.

Hope this helps get you started.

Take care.

2022年3月16日(水) 8:17 Roy Walter via Exist-open <
exi...@li...>:

> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript.  This would require
> first rendering each page and scraping the result.  Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

-- 
Michael Westbay
Writer/System Administrator
http://www.japanesebaseball.com/

Re: [Exist-open] Page rendering

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Page rendering