Hi!
I noticed that a b/w image ist turned into a grayscale image when running unpaper on it.
This is what I did:
1. Import a b/w pdf
2. Export it as DjVu: The use of cjb2 indicates that gscan2pdf treats it as a b/w image (I probably could have found this out easier, but I did't know how)
3. Run unpaper on it
4. Export it again as DjVu: The use of c44 indicates that gscan2pdf treats it as a grayscale or color image.
This is what I suppose to be the reason for it:
By default, unpaper gives what it (thinks it) gets: If you pass it a b/w image, it outputs b/w, if you pass it a grayscale image, it outputs grayscale. This seems to be determined by the file extension: pbm is b/w, pgm is grayscale.
gscan2pdf, on the other hand, stores its intermediate files as pnms and just stores the image depth with its internal metadata (this is at least what I think I found out). unpaper doesn't know the image depth and treats the image as grayscale.
How this could be solved:
I think the issue could be solved by one/more of the following:
1. use .pbm as temp file extension for b/w images
2. pass unpaper an option "--type pbm" to get out a b/w image in pbm format
3. pass unpaper an option "--depth 1" to get out a b/w image
Keep up the good work!
A preliminary patch that should fix the issue. These are my fist lines in perl at all, so please don't be too harsh if I made a silly mistake. :-)
I don't know if notifications were sent when I added the patch. Could anyone have a look at it and improve and/or commit it? I did use gscan2pdf with this patch applied to clean up some of my scanned documents, but I don't know if I got all corner cases right.
Thanks for the patch.
This would be a bug in unpaper, but I can't reproduce it. Can you attach a PDF which demonstrates it?
Which version of unpaper are you using?
A DjVu test file
I added a .djvu file that shows the problem. You can reproduce it by doing the following:
1) Import it. Export it.
2) Unpaper it. Export it.
The file resulting the second export is much bigger because it's now grayscale, not b/w.
I'm using unpaper 0.2.
The problem was not unpaper but imagemagick not keeping the depth when converting from TIFF to PNM. Your patch, however, was nonetheless the correct approach and I applied it almost unaltered. There are no corner cases, as the resulting PBM is in any case deleted as soon as unpaper has processed it.
Thanks once again.
Updated patch respecting the depth of pnm files, too
Hm, I found a case where the fix fails: When the temporary image gs2p uses is a pnm instead of a tif, the depth is not tested, and unpaper still converts it to grayscale. So it seems that even b/w pnms have to be explicitly converted to pbms for unpaper to output a b/w image.
I updated the patch and hope it will provide a more robust solution.
Can you please also provide a .pnm that reproduces the problem.
Example files: a pnm (before running unpaper) and a log from the gs2p run
I added an archive containing a pnm example file and the log from the gs2p (unpatched) run.
The pnm was generated by importing a pdf file. I noticed one strangeness with a couple of pdf files: After importing, they are inverted, so I had to run the invert filter in order to get a usable output. Should I file a new bug for this?
Thanks for the info. However -
$ identify before_unpaper.pnm
before_unpaper.pnm PNM 3508x2481 3508x2481+0+0 DirectClass 8-bit 24.9005mb 0.640u 0:02
i.e. the image was 8-bit before unpaper saw it - and was imported as such from the PDF.
That some PDFs are imported inverted is a bug in poppler and xpdf
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/134313
Ok, I did a bit more research: The image was imported as 1-bit:
$ identify VNq1NeRMn8.pnm
VNq1NeRMn8.pnm PNM 3508x2481 3508x2481+0+0 PseudoClass 2c 1-bit 1.03872mb
Then, inverting turned it into 8-bit:
$ identify T1kkHNPLo6.pnm
T1kkHNPLo6.pnm PNM 3508x2481 3508x2481+0+0 DirectClass 8-bit 24.9005mb 0.770u 0:02
This would explain why unpaper also outputs it as 8-bit. But it does not explain why
* exporting the inverted image as DjVu uses cjb2, but exporting the unpapered image uses c44
* my updated patch works: Checking for depth with Image::Magick returns 1 for the inverted image
So maybe at this point the real problem is that inverting doesn't preserve the image depth? If I understand the negate function correctly, it should ensure that the image depth is preserved, so I have no clue why it doesn't work. I attach my pdf test file, so maybe you can find out something relevant.
PDF test file
Thanks for the info. However -
$ identify before_unpaper.pnm
before_unpaper.pnm PNM 3508x2481 3508x2481+0+0 DirectClass 8-bit 24.9005mb 0.640u 0:02
i.e. the image was 8-bit before unpaper saw it - and was imported as such from the PDF.
That some PDFs are imported inverted is a bug in poppler and xpdf
https://bugs.launchpad.net/ubuntu/+source/poppler/+bug/134313
Apologies for the previous duplicated comment - injudicious use of the back button.
I get the same warning from pdfimages (v3.00) that the PDF is corrupt, but here, it is not inverted.
Inverting it, I still get a 1-bit image.
Hm, that's strange. For me, negating the image turns it into 8-bit. I used this small test script on the image extracted by pdfimages (3.00):
use Image::Magick;
my $image = Image::Magick->new;
$image->Read("x-000.pbm");
$image->Negate;
$image->Write(depth => 1, filename => "test.pnm");
The images are identified as follows:
$ identify x-000.pbm
x-000.pbm PNM 3508x2481 3508x2481+0+0 PseudoClass 2c 1-bit 1.03872mb 0.590u 0:02
$ identify test.pnm
test.pnm PNM 3508x2481 3508x2481+0+0 DirectClass 8-bit 24.9005mb 1.360u 0:02
Maybe one should explicitly set the file name extension to ".pbm" for b/w images? Imagemagick doesn't seem to respect the depth when writing .pnm files.
I've just tried this on a stock Intrepid machine. The PDF is imported inverted (pdfimages 3.00). After negating (ImageMagick 6.3.7) it is still 1 bit. unpaper, with double format and output-pages=2, gives me 1 page 1-bit, and the other 8-bit.
I will investigate.
I can no longer reproduce this. Can you?