NAPS2 - Not Another PDF Scanner / Discussion / General Discussion: How are imported/exported images handled?

Tony Jones - 2018-11-14

I'm using the following:

NAPS 6.0.3 (I don't think it's 6.X related)

pdfimages v0.51

The help for pdfimages is rather confusing, it states:
-png : change the default output format to PNG
-tiff : change the default output format to TIFF
-j : write JPEG images as JPEG files
-jp2 : write JPEG2000 images as JP2 files
-jbig2 : write JBIG2 images as JBIG2 files
-ccitt : write CCITT images as CCITT files
-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt

My understanding (I would welcome education on this) was that there are only two possible encodings for an XObject, raw data or jpeg data. In the source PDF, it appears to me that the Xobject is raw image data.

Anyways, if I specify -all, pdfimages exports the data as png.

The issue is that if I import a PDF into NAPS, then export a page as PNG, the data size in bytes is far larger than if I use pdfimages to extract. from the original pdf

$ ls -l naps2/base-000.png pdfimages/base-000.png
-rwxrwx---+ 1 tony None 126515 Nov 14 10:17 naps2/base-000.png
-rwxrwx---+ 1 tony None 76391 Nov 14 09:27 pdfimages/base-000.png

$ magick identify naps2/base-000.png pdfimages/base-000.png
naps2/base-000.png PNG 2500x3250 2500x3250+0+0 8-bit sRGB 126515B 0.000u 0:00.00
0
pdfimages/base-000.png PNG 2500x3250 2500x3250+0+0 8-bit sRGB 76391B 0.000u 0:00
.000

Similarly if I simply import the PDF into NAPS and save it again, the PDF is significantly larger.

$ ls -l naps2export.pdf orig.pdf
-rwxrwx---+ 1 tony None 119512540 Nov 14 10:32 naps2export.pdf
-rwxrwx---+ 1 tony None 85687738 Nov 14 10:28 orig.pdf

I'm not sure whether this is occuring at the import, or the export stage. A useful feature would be to be able to click on a page and have an identify option which would display info about the image (similar to magick identify).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2018-11-14

Also, if I specify no options to pdfimages, it chooses by default .ppm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Olden-Cooligan - 2018-11-14

If the original PDF does not come from NAPS2, the issue may be that NAPS2 is rendering the PDF slightly differently than the raw image data which is making it harder to compress. Pdfmages might be smarter about how it extracts image data from PDFs. Can you attach a sample original PDF that has this issue?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2018-11-14

It doesn't appear to be an issue with the PDF import.

I think it's just an intrinsic issue with png files and how NAPS2 converts to/from it's internal format

Using a png previously exported by NAPS2, if I compress it using pngcrunch the resulting file is reduced in size from 323,672 to 162,659 bytes (attached).

If I import this into NAPS, and save it as a png, the resulting file is 326,819 bytes

I do have the very original PDF file. I can certainly make it available but I'm not use it's useful as it's not clear that the Ximage data is imported in the same manner for a) NAPS2 pdf import (gs?) vs b) pdfimages pdf->png + NAPS2 png import.

This original PDF I began with was 85,687,738 bytes. I've combined about 34 11x17 images into 17 11x24" images using Microsoft image compositor, rearranged a bunch of pages, deskewed pages and added 6 missing 8.5x11" pages (new scan). So a lot of work but not much new content. The problem is that the resulting PDF out of NAPS is 232,270,694 bytes, close to triple the original size.

The only option I can see is to convert the PNGs that have been shrunk by pngcrunch to a different format that NAPS2 can import and that is more stable sizewise inside NAPS2.

base.001.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2018-11-14

If I take the original PDF (85,687,738 bytes), import it and save it (no changes) the resuiting pdf is 119,512,540 bytes.

If I take the original PDF (85,687,738 bytes), import it, run all the pages through deskew, save it a) the save takes ~30x longer [I assume there is an optimization for unaltered pages] b) the resulting pdf is 221,698,700 bytes which explains the situation I've got myself into.

So there are two issues:
1) The above. I can make the original pdf available if it will help.
2) Previous post. I don't want to redo all the changes. Do you have a suggestion of what I can do with the pngs that I've reduced in size through pngcrunch so that I can get them into NAPS2 for OCRing and saved back out as a reasomably sized PDF?

Last edit: Tony Jones 2018-11-14

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2018-11-14

I emailed you a link to PDFs. They are large.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2019-04-13

Ben.

Any more suggestions here?

I just started trying to clean up another PDF file, 502 pages.

31,401,008 bytes initially (http://vintagedirtbiker.com/photos/Motorcycles%20and%20Projects/Manuals/Honda/NSR250%20(English)%20Manual.PDF
)

31,495,944 bytes if I import it into NAPS and save it out again (default PDF settings)

332,370,526 bytes if I import it into NAPS, run deskew on all pages and save it out again (default PDF settings)

I was actually able to cause it to save out as
1,042,226,204 bytes if if I import into NAPS, run deskew on all pages, manually rotate by degree about 70 pages that deskew did not do correctly and save it out again (default PDF settings)

This is all using version 6.0.4.22698

Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ben Olden-Cooligan - 2019-04-13
  
  That file specifically does some weird things with compression, with parts of the image being grayscale and parts being pure black/white. That's probably not something NAPS2 is ever going to do.
  
  If you're willing to accept some loss of fidelity you could use the Black+White image option.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ben Olden-Cooligan - 2019-04-13
  
  The other issue is that NAPS2 can't tell the DPI of imported PDFs, so whenever you do some image editing it renders it at 300dpi, which is probably a lot higher resolution than that doc actually is. At some point I'd like to add a way to change this, but at the moment it's not configurable.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Tony Jones - 2019-04-13
    
    Hmmn.
    
    orig directory is the images exported out of NAPS2 after importing the PDF, with no editing.
    deskew.fixed directory is after deskew and hand rotating and export images
    
    $ stat --printf="%s " orig/img.006.png deskew.fixed/img.006.png
    533326 1602573
    
    So it's 3x the size after deskew/rotation.
    
    Image Magic claims the resolution of the files is pretty close.
    
    $ magick identify -format "%w x %h %x x %y" orig/img.006.png deskew.fixed/im
    g.006.png
    2550 x 3300 118.11 x 118.11
    2550 x 3300 118.09999999999999 x 118.09999999999999
    
    Attached is the original image.
    
    Last edit: Tony Jones 2019-04-13
    
    img.006.png
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2019-04-13

Here is the 3x size deskewed image.

img.006.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Jones - 2019-04-13

Looking at the previous two files (img.006) using ImageMagick's identify feature I'm not seeing a lot of differences to account for the 3x size increase, The modified version has a PNG color type of RGBA compared to TruColor in the original. pngcrunch didn't achieve any meaningful reduction of the deskewed img.006.png, it's still 3x the size of the original.

The comment on greyscale is interesting. The manual has lots of pages containing greyscale photos but none of those pages show any noticable increase in file size. Some of the largest size increases are on pages you would not expect such as the attached that went from 58k to 283k when the skew was removed.

Last edit: Tony Jones 2019-04-13

img.111-deskew.png

img.111.orig.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Olden-Cooligan - 2019-04-13

I had another look and it seems like the native DPI might actually be 300 for this doc, so that's not the issue. The thing to understand though, is that when you import a PDF and then save it as PNG (or do any kind of image editing), NAPS2 has to guess at the resolution. But if you don't make any changes to the page, there's a special case that copies the raw PDF page from the source file.

Specifically the issue with the PNG above (and probably a lot of the pages) is that you start with a two-tone image (black and white only). But when you deskew it, the pixels blur together, so you get all kinds of grey pixels as well. PNG has a much harder time compressing that. And if you save it as a PDF, it can't use the super-efficient two-tone encoding.

You can improve it by using the "Black and White" option under the image menu first. But of course, for any parts of the image that are grayscale (the pictures) that won't look great.

Last edit: Ben Olden-Cooligan 2019-04-13

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tony Jones - 2019-04-13
  
  I was aware of the special case (pass thru on non edit). I'm fairly sure I actually hand edited many of the picture pages but of course since they are greyscale to begin with, they don't grow in size much. I understand what you're saying about the black and white pages. However, I need the images to be readable and they won't be if the whole document is converted to black and white.
  
  This would seem to be something that would be fairly common, no? Or (in the input doc) is having the simple pages optimized to black/white fairly unusual?
  
  This seems basically the same as what was happening in the first post on this thread (an entirely different document). That document didn't have any greyscale images needing protecting so the b/w option might be viable for it, I'd have to check but it's not viable for this second document as I need those images to be interpretable.
  
  When you say "That's probably not something NAPS2 is ever going to do" what is the reason for this? This is an open source project, and I'm a software engineer, so I should chip in only adding another personal project to my queue when I'm not making any progress on the other half dozen doesn't seem smart but I'm curious on the basis for the comment, what the challenges are. etc?
  
  Are there any other workarounds or tools that might be able to deskew and OCR this original document without ballooning it's size (31MB original to 1042MB after the deskew version went through OCR is a big increase).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tony Jones - 2019-04-13
  
  Also, I'm not sure the comment "is that you start with a two-tone image (black and white only)." is true. If you look closely at 006 and 111 the hole punches for example are greyscale in both images. Obviously preserving the nature of the holes in paper isn't anything I much care about, I was just pointing out that these are not b/w images. OTOH maybe I'm not understanding something.
  
  Last edit: Tony Jones 2019-04-13
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Olden-Cooligan - 2019-04-13

The complex thing here is that whatever software originally generated the PDF, had some algorithm to encode parts of the image as grayscale and parts as BW. If you zoom in enough you should be able to tell the regions apart.

That's very non-trivial to do, and I'm unlikely to prioritize it. It's not impossible that someone might contribute some code to do that, but that's the source of the "probably not something NAPS2 is going to do".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tony Jones - 2019-04-13
  
  I see. Thanks for the clarification.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Tony Jones - 2019-04-15
    
    I tried using scantailor. I noticed it's not really maintained. Last update was several years ago.
    
    a) it's much harder to use than NAPS b) the image quality is badly affected by the transformations as you can see in the attached.
    
    For pages that contain a lot of greyscale already, there isn't much size expansion after NAPS deskews and the quality is very good compared to the original.
    
    It's just the simpler pages containing b/w elements that suffer a size expansion.
    
    When I get some time I'll try digging into the code as NAPS seems to be doing a really good job in every other regard. I have noticed some cases where deskew gets things badly wrong also, I'll look at this also.
    
    Thanks for such a great program. Please don't take the above as any kind of complaint.
    
    naps deskew 3.49mb.png
    
    orig 3.34mb.png
    
    scantailor 6.02mb.png
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How are imported/exported images handled?

Scan documents to PDF and other file types, as simply as possible.

Forums

Help

How are imported/exported images handled?

How are imported/exported images handled?

Scan documents to PDF and other file types, as simply as possible.

Forums

Help

How are imported/exported images handled? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How are imported/exported images handled?