From: Chris W. <kit...@gm...> - 2022-03-16 18:40:27
|
Thanks guys for the help. I fancy leaving the page scaping in exist and simply returning the full html as a response from pupetteer - there may be issues with cookies I expect. Kevin, a page scraper doesn't need CSS to be rendered, it's only looking at the HTML, not the screen image. Puppeteer does render CSS to form an image anyway doesn't it - I thought its main purpose was to automate unit testing by generating screen shots, with possibly different screen/browser configurations? Chris On Wed, Mar 16, 2022 at 6:11 PM Kevin Brown <kev...@xp...> wrote: > The problem with that is that the CSS is not resolved. Hence I pointed you > the the @cloudformatter. You can examine the JS in those pages that > actually resolves the CSS and yields full HTML. > > Kevin > > > > Sent from my Verizon, Samsung Galaxy smartphone > > > -------- Original message -------- > From: Roy Walter <gar...@ya...> > Date: 3/16/22 4:38 AM (GMT-08:00) > To: kev...@xp..., Chris Wallace <kit...@gm...> > Cc: exist-open <exi...@li...> > Subject: Re: [Exist-open] Page rendering > > Hi Chris, > > You can do that. (We do.) > > Configure a web server in nodejs and send a GET request from eXist with > the URL as a parameter.Then do the scraping in puppeteer/Chromium. You can > either process the returned html/xml payload directly or PUT the file to > eXist from nodejs. > > R. > > > On Tuesday, 15 March 2022, 23:12:38 GMT, Chris Wallace < > kit...@gm...> wrote: > > > Thanks Roy - yes that's just the ticket . Actually I was wondering if one > could set up an HTTP service based on puppeteer which took a url, rendered > the page and returned the full page HTML so that the current scraper would > be unchanged. Surprised there isn't a service already to do just that - > perhaps there is. > > Kevin, it's just a case of the common problem nowadays of scraping the > Javascript-generated web we see these days. XQuery is great for scraping > HTML pages but Javascript-generated pages have to be rendered by a browser > engine before we get the displayed HTML. I wondered if anyone had wrapped > a browser engine for use in exist-db. Perhaps mentioning PDFs was > confusing - I parse PDF documents using the Content Extraction Module which > wraps Apache Tika ( does a good job although it does miss spaces sometimes > which is a problem for text analysis) and this is unchanged. > > Chris > > On Tue, Mar 15, 2022 at 8:23 PM Roy Walter <gar...@ya...> > wrote: > > Can't be done. > > nodejs/puppeteer/Chromium is the way. > > -- Roy > > > > On Tuesday, 15 March 2022, 19:49:28 GMT, Chris Wallace < > kit...@gm...> wrote: > > > I developed an interface to the Bristol City Planning portal which relies > on page-scraping and PDF parsing: > > https://bristoltrees.space/Planning/map > > The latest version of the software (by iDox) has made the change that so > many sites have done, generating pages in JavaScript. This would require > first rendering each page and scraping the result. Any thoughts on doing > this in exist_db? > > And no, there is no public API that I can find > > Chris > _______________________________________________ > Exist-open mailing list > Exi...@li... > https://lists.sourceforge.net/lists/listinfo/exist-open > > |