From: Chris W. <kit...@gm...> - 2022-03-15 19:49:13
|
I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris |
From: Chris W. <kit...@gm...> - 2022-03-15 23:14:21
|
Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > |
From: Kevin B. <kev...@xp...> - 2022-03-15 23:25:25
|
See my system @cloudformatter. But it is “in browser” … but it scrapes the HTML, converts that HTML to XSL FO and uses RenderX to format the page. Post Javascript. www.cloudformatter.com/css2pdf <http://www.cloudformatter.com/css2pdf> We have many installations of this sitting on top of eXist. Kevin From: Chris Wallace <kit...@gm...> Sent: Tuesday, March 15, 2022 4:14 PM To: Roy Walter <gar...@ya...>; kev...@xp... Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya... <mailto:gar...@ya...> > wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm... <mailto:kit...@gm...> > wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris _______________________________________________ Exist-open mailing list Exi...@li... <mailto:Exi...@li...> https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Roy W. <gar...@ya...> - 2022-03-16 11:38:44
|
Hi Chris, You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs. R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris_______________________________________________ Exist-open mailing list Exi...@li... https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Kevin B. <kev...@xp...> - 2022-03-16 18:11:45
|
The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML.KevinSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris,You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs.R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is.Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged.ChrisOn Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: Can't be done.nodejs/puppeteer/Chromium is the way.-- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing:https://bristoltrees.space/Planning/mapThe latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db?And no, there is no public API that I can findChris _______________________________________________Exist-open mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Chris W. <kit...@gm...> - 2022-03-16 18:40:27
|
Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations? Chris On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...> wrote: > The problem with that is that the CSS is not resolved. Hence I pointed you > the the @cloudformatter. You can examine the JS in those pages that > actually resolves the CSS and yields full HTML. > > Kevin > > > > Sent from my Verizon, Samsung Galaxy smartphone > > > -------- Original message -------- > From: Roy Walter <gar...@ya...> > Date: 3/16/22 4:38 AM (GMT-08:00) > To: kev...@xp..., Chris Wallace <kit...@gm...> > Cc: exist-open <exi...@li...> > Subject: Re: [Exist-open] Page rendering > > Hi Chris, > > You can do that. (We do.) > > Configure a web server in nodejs and send a GET request from eXist with > the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can > either process the returned html/xml payload directly or PUT the file to > eXist from nodejs. > > R. > > > On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace < > kit...@gm...> wrote: > > > Thanks Roy - yes that's just the ticket . Actually I was wondering if one > could set up an HTTP service based on puppeteer which took a url, rendered > the page and returned the full page HTML so that the current scraper would > be unchanged. Surprised there isn't a service already to do just that - > perhaps there is. > > Kevin, it's just a case of the common problem nowadays of scraping the > Javascript-generated web we see these days. XQuery is great for scraping > HTML pages but Javascript-generated pages have to be rendered by a browser > engine before we get the displayed HTML. I wondered if anyone had wrapped > a browser engine for use in exist-db. Perhaps mentioning PDFs was > confusing - I parse PDF documents using the Content Extraction Module which > wraps Apache Tika ( does a good job although it does miss spaces sometimes > which is a problem for text analysis) and this is unchanged. > > Chris > > On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> > wrote: > > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > |
From: Roy W. <gar...@ya...> - 2022-03-16 19:46:28
|
Hi Chris,, That's what I meant. Just grab (scrape!) the page HTML in nodejs and pass it back to eXist for processing. You shouldn't have any cookie issues if you configure puppeteer the right way. See puppeteer-extra. Just let you existing scraper in eXist do its thing with what you get back from nodejs. R. On Wednesday, 16 March 2022, 18:39:45 GMT, Chris Wallace <kit...@gm...> wrote: Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations? Chris On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...> wrote: The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML. Kevin Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris, You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs. R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris_______________________________________________ Exist-open mailing list Exi...@li... https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Kevin B. <kev...@xp...> - 2022-03-17 00:59:25
|
Yes my bad. I thought (well because I am part owner of RenderX) that you were looking at rendering the page and not just scraping it. …. When your favorite tool is a hammer, everything looks like a nail …. Kevin From: Roy Walter <gar...@ya...> Sent: Wednesday, March 16, 2022 12:15 PM To: Kevin Brown <kev...@xp...>; wes...@ja...; Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris,, That's what I meant. Just grab (scrape!) the page HTML in nodejs and pass it back to eXist for processing. You shouldn't have any cookie issues if you configure puppeteer the right way. See puppeteer-extra. Just let you existing scraper in eXist do its thing with what you get back from nodejs. R. On Wednesday, 16 March 2022, 18:39:45 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations? Chris On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown < <mailto:kev...@xp...> kev...@xp...> wrote: The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML. Kevin Sent from my Verizon, Samsung Galaxy smartphone -------- Original message -------- From: Roy Walter < <mailto:gar...@ya...> gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: <mailto:kev...@xp...> kev...@xp..., Chris Wallace < <mailto:kit...@gm...> kit...@gm...> Cc: exist-open < <mailto:exi...@li...> exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris, You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs. R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter < <mailto:gar...@ya...> gar...@ya...> wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < <mailto:kit...@gm...> kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: <https://bristoltrees.space/Planning/map> https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris _______________________________________________ Exist-open mailing list <mailto:Exi...@li...> Exi...@li... <https://lists.sourceforge.net/lists/listinfo/exist-open> https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Roy W. <gar...@ya...> - 2022-03-15 23:15:48
|
Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris_______________________________________________ Exist-open mailing list Exi...@li... https://lists.sourceforge.net/lists/listinfo/exist-open |
From: Michael W. <wes...@ja...> - 2022-03-16 00:35:57
|
I second this approach. Here is the relevant code needed to do it from the eXist side: ... declare variable $pages:GET-PUPPETEER-PAGE := 'get-puppet-page.js'; declare variable $pages:NODE-BIN := '/usr/local/bin/node'; declare variable $pages:NODE-OPTIONS := <options> <workingDir>/home/my-user/bin</workingDir> <environment> <env name="CHROME_DEVEL_SANDBOX" value="/usr/local/sbin/chrome-devel-sandbox"/> </environment> </options>; declare variable $pages:STRIP-LINES := util:function(xs:QName('local:strip-lines'), 1); declare function local:strip-lines($text as xs:string) as xs:string { replace(replace(replace($text, '<line>', ''), '</line>', ' '), '<line/>', ' ') }; ... declare function pages:puppet-page($url) as xs:string? { let $cmd := ($pages:NODE-BIN, $pages:GET-PUPPETEER-PAGE, $url) let $raw-page := process:execute($cmd, $pages:NODE-OPTIONS) return if (exists($raw-page)) then util:call($pages:STRIP-LINES, $raw-page) else () }; The STRIP-LINES stuff is there because process:execute returns each line of the result with <LINE>contents</LINE>, which needs to be stripped out before XHTML-izing the rest. This was originally written for eXist 2.x, so I didn't use pipes in the filter. If I were to rewrite it now, it would probably be like: declare function local:strip-lines($text as xs:string) as xs:string { $text => replace('<line>','') => replace('</line'>,' ') }; Notice that the CR is within the replacement quotes in the second replace. It helps a lot for readability of the results. I also have a whole slew of other checks in my version, along with caching of the results that are not included here. Hope this helps get you started. Take care. 2022年3月16日(水) 8:17 Roy Walter via Exist-open < exi...@li...>: > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > -- Michael Westbay Writer/System Administrator http://www.japanesebaseball.com/ |
From: Kevin B. <kev...@xp...> - 2022-03-16 00:31:48
|
What is it that you are looking for? Rendering content to PDF? Scraping content from PDF? Both? I think that you need to explain a little more to help us understand what you are looking for exactly … I for one am confused. Kevin Brown RenderX From: Chris Wallace <kit...@gm...> Sent: Tuesday, March 15, 2022 12:49 PM To: exist-open <Exi...@li...> Subject: [Exist-open] Page rendering I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris |