Glitches when saving PDF in LZW mode
Brought to you by:
ra28145
I found a problem when saving (correctly!! scanned) scans as PDF when using LZW.
ZIP and AUTOMATIC appear to work without problems, but when using LZW - which I suppose is the standard for lossless-compression in PDFs - the created PDF have glitches on the first page, see http://i.imgur.com/hX3Y9EH.png .
Please attach an image, which when imported and saved as PDF with LZW compression, reproduces the problem. Please also post a log file created with the --log=log option, as I suspect that there are a combination of factors at play.
I will investigate this issue later with non-confidential original images which show the problem, and add log output as requested.
Same here (unfortunately I don't have any non-confidential scans at hand either). Version is 1.3.4. I have the impression that this happended more frequently in previous versions.
I have managed to "rescue" these scans by re-importing the PDF files, then scanning the faulty pages again (though this is most likely just taking advantage of the sporadic nature of the issue).
It seems to me that gscan2pdf mixes mixes up the encoding in such a way that a sequence of pixels is mapped to a code that decodes to a shorter sequence. As a result, pixels are missing in the output and the rest of the image is shifted.
If I come across this issue in a non-confidential scan, I will post the data here.
Amendment: Just realized that the default mode is "automatic", which I've never changed, but had the issue described above nonetheless. As far as I can remember, all scans were lineart (1 bit per pixel).
Automatic defaults to LZW for 1bit images. If you use PNG compression for PDFs with 1bit images, the problem should go away. The bug in LZW compression is not in gscan2pdf itself, but in the PDF::API2 module, but as I cannot reproduce the problem, I cannot report it upstream.
Last edit: Jeffrey Ratcliffe 2016-06-06
(I reported this bug) I do not understand how a lossless compression (LZW) can introduce such errors, perhaps someone can explain.
Here is a file that, when opened with gscan2pdf and saved as PDF using LZW compression (or "Automatic"), causes the same glitch. In this case the glitch is so severe that the resulting PDF document is completely useless.
gscan2pdf version: 1.5.2
Other compression settings work fine (PNG, ZIP and Packbits tested. ZIP seems like the better option and the file is even 6 bytes smaller than the distorted LZW one. Maybe that should be the default?)
Assuming that you can reproduce the error, will you report it upstream? I have no idea about these things, just had a lot of trouble yesterday with this image that I scanned, until I finally came across this thread.
Here's the distorted PDF in case anyone wants it.
I don't think that the problem has to do with the fact that the compression is lossless or lossy. LZW compression is implemented via TIFF - i.e. the images are first converted to TIFF wit LZW compression and then embedded in the PDF. TIFF stores images in strips, if I am not mistaken, and I assume that the bug is in the way PDF::API2 writes the TIFF object.
Thanks for providing the TIFF, Sebastian.
Unfortunately, I cannot reproduce the problem on my machine, which tells me that the bug is either in libtiff, or PDF::API2.
Please start gscan2pdf from the command line:
import page3.tif, save it as PDF with LZW compression, quit, and post the log file.
Thanks for your reply. I have attached the log file. Looking at the log file, gscan2pdf appears to use PDF::API2 version 2.020. As far as I can see in Synaptic, my libtiff (libtiff5) is version "4.0.3-7ubuntu0.4". I can confirm that I still get the glitch. I open the TIFF, and everything looks fine, but the PDF looks torn when using LZW.
Looking at the log file, you can see the following lines which are written whilst the PDF is being saved. The temporary filenames will be different every time, but the process will stay the same. Please post the two files that appear on the first line. In your example, a5XBbdzCfQ.tif should be býte-identical to the page3.tif that was imported. K57w3auZI8.tif, in theory, should be the same, but LZW-compressed. However, according to imagemagick identify, page3.tif was compressed with LZW, so all three should be identical.
INFO - tiffcp -c lzw /tmp/gscan2pdf-oSHh/a5XBbdzCfQ.tif /tmp/gscan2pdf-oSHh/K57w3auZI8.tif
INFO - Defining page at 593.28pt x 824.76pt
INFO - Added /tmp/gscan2pdf-oSHh/K57w3auZI8.tif at 200 PPI
INFO - Closing PDF
DEBUG - Finished saving /home/sebastian/PDF/new_test.pdf
What do the identify and file tools return for the 3 TIFFs? You'll have to check whilst gscan2pdf is running, as it deletes the temporary files before quitting.
I see only one TIFF file in the temp folder. It is created when "page 3.tif" is loaded and they are identical. The log lists another TIFF, as you said, but it's not in the temp folder. Perhaps it's deleted immediately after the PDF is created? I suppose it doesn't need it anymore, whereas the first TIFF seems to be a working copy of the original.
Excerpt from the log:
DEBUG - save filename dialog returned ok
DEBUG - Started saving /home/sebastian/PDF/new_test3.pdf
INFO - Using /usr/share/fonts/truetype/msttcorefonts/Times_New_Roman.ttf for non-ASCII text
INFO - tiffcp -c lzw /tmp/gscan2pdf-tN_O/xlR3SMg2GK.tif /tmp/gscan2pdf-tN_O/DbQCROeQiq.tif
INFO - Defining page at 593.28pt x 824.76pt
INFO - Added /tmp/gscan2pdf-tN_O/DbQCROeQiq.tif at 200 PPI
INFO - Closing PDF
DEBUG - Finished saving /home/sebastian/PDF/new_test3.pdf
The file DbQCROeQiq.tif is nowhere to be found.
Complete log attached.
THANK YOU!
The bug is definitely in PDF::API2, as expected. The following Perl code reproduces it:
Now I've got something I can report to the author of PDF::API2 and this has finally got a change of getting fixed.
Note that running the above code, apart from creating the distorted PDF, also produces the following error messages (several times):
These two patches (for PDF::API2) fix things for me