From: Michael W. <wes...@ja...> - 2022-03-16 00:35:57
|
I second this approach. Here is the relevant code needed to do it from the eXist side: ... declare variable $pages:GET-PUPPETEER-PAGE := 'get-puppet-page.js'; declare variable $pages:NODE-BIN := '/usr/local/bin/node'; declare variable $pages:NODE-OPTIONS := <options> <workingDir>/home/my-user/bin</workingDir> <environment> <env name="CHROME_DEVEL_SANDBOX" value="/usr/local/sbin/chrome-devel-sandbox"/> </environment> </options>; declare variable $pages:STRIP-LINES := util:function(xs:QName('local:strip-lines'), 1); declare function local:strip-lines($text as xs:string) as xs:string { replace(replace(replace($text, '<line>', ''), '</line>', ' '), '<line/>', ' ') }; ... declare function pages:puppet-page($url) as xs:string? { let $cmd := ($pages:NODE-BIN, $pages:GET-PUPPETEER-PAGE, $url) let $raw-page := process:execute($cmd, $pages:NODE-OPTIONS) return if (exists($raw-page)) then util:call($pages:STRIP-LINES, $raw-page) else () }; The STRIP-LINES stuff is there because process:execute returns each line of the result with <LINE>contents</LINE>, which needs to be stripped out before XHTML-izing the rest. This was originally written for eXist 2.x, so I didn't use pipes in the filter. If I were to rewrite it now, it would probably be like: declare function local:strip-lines($text as xs:string) as xs:string { $text => replace('<line>','') => replace('</line'>,' ') }; Notice that the CR is within the replacement quotes in the second replace. It helps a lot for readability of the results. I also have a whole slew of other checks in my version, along with caching of the results that are not included here. Hope this helps get you started. Take care. 2022年3月16日(水) 8:17 Roy Walter via Exist-open < exi...@li...>: > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > -- Michael Westbay Writer/System Administrator http://www.japanesebaseball.com/ |