From: Roy W. <gar...@ya...> - 2022-03-16 19:46:28
|
Hi Chris,, That's what I meant. Just grab (scrape!) the page HTML in nodejs and pass it back to eXist for processing. You shouldn't have any cookie issues if you configure puppeteer the right way. See puppeteer-extra. Just let you existing scraper in eXist do its thing with what you get back from nodejs. R. On Wednesday, 16 March 2022, 18:39:45 GMT, Chris Wallace <kit...@gm...> wrote: Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations? Chris On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...> wrote: The problem with that is that the CSS is not resolved. Hence I pointed you the the @cloudformatter. You can examine the JS in those pages that actually resolves the CSS and yields full HTML. Kevin Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Roy Walter <gar...@ya...> Date: 3/16/22 4:38 AM (GMT-08:00) To: kev...@xp..., Chris Wallace <kit...@gm...> Cc: exist-open <exi...@li...> Subject: Re: [Exist-open] Page rendering Hi Chris, You can do that. (We do.) Configure a web server in nodejs and send a GET request from eXist with the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can either process the returned html/xml payload directly or PUT the file to eXist from nodejs. R. On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace <kit...@gm...> wrote: Thanks Roy - yes that's just the ticket . Actually I was wondering if one could set up an HTTP service based on puppeteer which took a url, rendered the page and returned the full page HTML so that the current scraper would be unchanged. Surprised there isn't a service already to do just that - perhaps there is. Kevin, it's just a case of the common problem nowadays of scraping the Javascript-generated web we see these days. XQuery is great for scraping HTML pages but Javascript-generated pages have to be rendered by a browser engine before we get the displayed HTML. I wondered if anyone had wrapped a browser engine for use in exist-db. Perhaps mentioning PDFs was confusing - I parse PDF documents using the Content Extraction Module which wraps Apache Tika ( does a good job although it does miss spaces sometimes which is a problem for text analysis) and this is unchanged. Chris On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> wrote: Can't be done. nodejs/puppeteer/Chromium is the way. -- Roy On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace <kit...@gm...> wrote: I developed an interface to the Bristol City Planning portal which relies on page-scraping and PDF parsing: https://bristoltrees.space/Planning/map The latest version of the software (by iDox) has made the change that so many sites have done, generating pages in JavaScript. This would require first rendering each page and scraping the result. Any thoughts on doing this in exist_db? And no, there is no public API that I can find Chris_______________________________________________ Exist-open mailing list Exi...@li... https://lists.sourceforge.net/lists/listinfo/exist-open |