From: Nathan V. G. <na...@va...> - 2015-05-14 14:50:14
|
Thanks Mike! On Thu, May 14, 2015 at 7:20 AM, Mike Metcalfe <mi...@we...> wrote: > Hi Mikel, > > I re-read you suggestion and realised that I don't need OCR because > extracting the text from the PDF is sufficient. This has obviously improved > performance all round. > > But in my attempts to improve performance I did get hold Ted Han from > docsplit who says the next version will be faster by using google's pdfium > code inside the pdfshaver gem > <snip> > you should be able to install the alpha branch of docsplit w/ the > pdfshaver changes just by saying `gem install docsplit --pre` and that > should download the 0.8.0.alpha1 > </snip> > Note that you need ruby2 or higher and you need pdfium installed - I used > this package > http://s3.documentcloud.org.s3.amazonaws.com/pdfium/libpdfium-dev_0.1%2Bgit20150311-1_amd64.deb > > Cheers > Mike > > > On 28 April 2015 at 09:05, Mike Metcalfe <mi...@we...> wrote: > >> Hi Mikel, >> >> That's a good idea, I do have some sites that don't need the text. >> Unfortunately the client with the really big PDFs does need the text for >> searching :-( >> >> On 28 April 2015 at 07:57, Mikel Larreategi <mla...@co...> >> wrote: >> >>> On Mon, Apr 27, 2015 at 1:22 PM, Mike Metcalfe <mi...@we...> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm using c.documentviewer on a number sites (thanks to wildcard for a >>>> great add-on) but have encountered some issues with large Pdf files. >>>> docsplit is still processing a 40MB file that I started 13 hours ago. From >>>> my readings on docsplit, this could be a size issue or it may be that the >>>> contents is "non standard" pdf. >>>> >>> >>> I had a similar issue in a site of ours, but the point was that I had >>> text-extracting configured and the background process was wasting a lot of >>> time trying to extract text from big PDF files made of JPGs. I disable the >>> text-extracting options from that site and the converting process was as >>> fast as before. >>> >>> http://egoibarra.eus/es/publicaciones >>> >>> Mikel >>> >>> -- >>> Mikel Larreategi >>> mla...@co... >>> >>> CodeSyntax >>> Azitaingo Industrialdea 3 K >>> E-20600 Eibar >>> Tel: (+34) 943 82 17 80 >>> >> >> >> >> -- >> Mike Metcalfe >> >> 082 903 8268 >> mi...@we... >> www.webtide.co.za >> > > > > -- > Mike Metcalfe > > 082 903 8268 > mi...@we... > www.webtide.co.za > -- Nathan Van Gheem Solutions Architect Wildcard Corp |