Menu

Best tool to extract image data from PDF for later reassembly by NAPS2

Tony Jones
2016-09-28
2021-02-16
  • Tony Jones

    Tony Jones - 2016-09-28

    I'm trying to deskew an existing PDF file (basically remove the ADF skew) and add OCR.

    NAPS2 will only export JPEG images from a PDF it has itself created (I'm not sure what the reason is for this restriction).

    I've tried various tools to extract the JPG data from the PDF but I've not found one that will extract at a decent resolution (or ideally at whatever is native inside the PDF). TTR PDF To JPG (sourceforge) was the worst, each page was saved as a 385x600 0.23MP 300dpi JPEG with a claimed print size of 1.23x2.00 inches.

    I tried using latest ImageMagick for Windows to directly operate on the PDF but it failed with a postscript stack error.

    So I ended up using pdfimages (from xpdf) which extracts each page as a 1bit deep PBM (rotated and inverted but that's not a big deal). -j doesn't work to extract in JPEG format ( I'm guessing because the data in the PDF is non DCT?). Each PBM is 2504x1608 4.03MP 1bit with a print size of 8.35"x5.36" 300dpi

    At this point I again use imagemagick, 'magick convert file.pbm -deskew 40 -invert -rotate 90 file.jpg'. I can then import the images into NAPS2 but the resulting PDF size is huge, 87MB compared to 8.5MB for the original.

    Is there a better way to go about this using opensource tools. I'm using NAPS2 to recontruct the final PDF as I want it OCR'd. Obviously there is no native deskew, so I'm going through the above steps.

    Thanks for any suggestions

     
  • Ben Olden-Cooligan

    I don't know if this would be any better, but you could trying using "Print to PDF" (using CutePDF or any other PDF printer) on the original PDF file, and then using those tools on the second PDF file.

    You could also use ImageMagick like you did, but reduce the resolution or quality of the JPEG files before importing them into NAPS2 to reduce the file size.

     
  • Tony Jones

    Tony Jones - 2016-09-30

    On Linux using img2pdf (https://github.com/josch/img2pdf) I get way better results going pdf->pbm->jp2 (jp2000) than I do going pdf->pbm->jpg.

    The original PDF isn't DCT so pdfimages can only extract as pbm or png. using imagemagick to convert to JPG just results in a) huge file or b) poor quality

    using imagemagick to convert to from pbm to jp2 and then img2pdf to convert to pdf results in a filesize of 20mb, same as the original and the quality is fine.

    unfortunately this achieves the deskew I wanted but no OCR and NAPS2 doesn't seem to import JPEG2000.

    Also, if I do try to import all the pages as JP2, I get a modal 'the file xxx,jp2 could not be imported'
    dialog for each, clicking ok doesn't give me time to click cancel on parent dialog, so I have to 'ok' through 472 dialogs.

    Tony

     
  • Kpym

    Kpym - 2021-02-16

    This is an old question, but on Windows, on Mac and on Linux, you can extract images from PDF (without recompressions) with different multi-platform tools:

    1) pdfcpu

    > pdfcpu extract -m image in.pdf .
    

    2) pdfimages

    > pdfimages -all in.pdf page
    

    3) mutool

    > mutool extract in.pdf
    

    4) img2pdf, that has been already mentionner for Linux, exists also for Windows and MAC (it's a Python script).

     

    Last edit: Kpym 2021-02-16

Log in to post a comment.