|
From: Michael W. <wes...@ja...> - 2022-03-16 00:35:57
|
I second this approach. Here is the relevant code needed to do it from the
eXist side:
...
declare variable $pages:GET-PUPPETEER-PAGE := 'get-puppet-page.js';
declare variable $pages:NODE-BIN := '/usr/local/bin/node';
declare variable $pages:NODE-OPTIONS := <options>
<workingDir>/home/my-user/bin</workingDir>
<environment>
<env name="CHROME_DEVEL_SANDBOX"
value="/usr/local/sbin/chrome-devel-sandbox"/>
</environment>
</options>;
declare variable $pages:STRIP-LINES :=
util:function(xs:QName('local:strip-lines'), 1);
declare function local:strip-lines($text as xs:string) as xs:string {
replace(replace(replace($text, '<line>', ''), '</line>', '
'), '<line/>', '
')
};
...
declare function pages:puppet-page($url) as xs:string? {
let $cmd := ($pages:NODE-BIN, $pages:GET-PUPPETEER-PAGE, $url)
let $raw-page := process:execute($cmd, $pages:NODE-OPTIONS)
return if (exists($raw-page))
then util:call($pages:STRIP-LINES, $raw-page)
else ()
};
The STRIP-LINES stuff is there because process:execute returns each line of
the result with <LINE>contents</LINE>, which needs to be stripped out
before XHTML-izing the rest. This was originally written for eXist 2.x, so
I didn't use pipes in the filter. If I were to rewrite it now, it would
probably be like:
declare function local:strip-lines($text as xs:string) as xs:string {
$text
=> replace('<line>','')
=> replace('</line'>,'
')
};
Notice that the CR is within the replacement quotes in the second replace.
It helps a lot for readability of the results.
I also have a whole slew of other checks in my version, along with caching
of the results that are not included here.
Hope this helps get you started.
Take care.
2022年3月16日(水) 8:17 Roy Walter via Exist-open <
exi...@li...>:
> Can't be done.
>
> nodejs/puppeteer/Chromium is the way.
>
> -- Roy
>
>
>
> On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <
> kit...@gm...> wrote:
>
>
> I developed an interface to the Bristol City Planning portal which relies
> on page-scraping and PDF parsing:
>
> https://bristoltrees.space/Planning/map
>
> The latest version of the software (by iDox) has made the change that so
> many sites have done, generating pages in JavaScript. This would require
> first rendering each page and scraping the result. Any thoughts on doing
> this in exist_db?
>
> And no, there is no public API that I can find
>
> Chris
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
> _______________________________________________
> Exist-open mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>
--
Michael Westbay
Writer/System Administrator
http://www.japanesebaseball.com/
|