Menu

#201 Glitches when saving PDF in LZW mode

v1.0_(example)
open
nobody
None
5
2016-10-04
2015-07-10
Wikinaut
No

I found a problem when saving (correctly!! scanned) scans as PDF when using LZW.

ZIP and AUTOMATIC appear to work without problems, but when using LZW - which I suppose is the standard for lossless-compression in PDFs - the created PDF have glitches on the first page, see http://i.imgur.com/hX3Y9EH.png glitch .

Discussion

  • Jeffrey Ratcliffe

    Please attach an image, which when imported and saved as PDF with LZW compression, reproduces the problem. Please also post a log file created with the --log=log option, as I suspect that there are a combination of factors at play.

     
  • Wikinaut

    Wikinaut - 2015-07-13

    I will investigate this issue later with non-confidential original images which show the problem, and add log output as requested.

     
  • Michael von Glasow

    Same here (unfortunately I don't have any non-confidential scans at hand either). Version is 1.3.4. I have the impression that this happended more frequently in previous versions.

    I have managed to "rescue" these scans by re-importing the PDF files, then scanning the faulty pages again (though this is most likely just taking advantage of the sporadic nature of the issue).

    It seems to me that gscan2pdf mixes mixes up the encoding in such a way that a sequence of pixels is mapped to a code that decodes to a shorter sequence. As a result, pixels are missing in the output and the rest of the image is shifted.

    If I come across this issue in a non-confidential scan, I will post the data here.

     
  • Michael von Glasow

    Amendment: Just realized that the default mode is "automatic", which I've never changed, but had the issue described above nonetheless. As far as I can remember, all scans were lineart (1 bit per pixel).

     
  • Jeffrey Ratcliffe

    Automatic defaults to LZW for 1bit images. If you use PNG compression for PDFs with 1bit images, the problem should go away. The bug in LZW compression is not in gscan2pdf itself, but in the PDF::API2 module, but as I cannot reproduce the problem, I cannot report it upstream.

     

    Last edit: Jeffrey Ratcliffe 2016-06-06
    • Wikinaut

      Wikinaut - 2016-06-06

      (I reported this bug) I do not understand how a lossless compression (LZW) can introduce such errors, perhaps someone can explain.

       
    • Sebastian Hatt

      Sebastian Hatt - 2016-09-07

      Here is a file that, when opened with gscan2pdf and saved as PDF using LZW compression (or "Automatic"), causes the same glitch. In this case the glitch is so severe that the resulting PDF document is completely useless.

      gscan2pdf version: 1.5.2
      Other compression settings work fine (PNG, ZIP and Packbits tested. ZIP seems like the better option and the file is even 6 bytes smaller than the distorted LZW one. Maybe that should be the default?)

      Assuming that you can reproduce the error, will you report it upstream? I have no idea about these things, just had a lot of trouble yesterday with this image that I scanned, until I finally came across this thread.

       
      • Sebastian Hatt

        Sebastian Hatt - 2016-09-07

        Here's the distorted PDF in case anyone wants it.

         
  • Jeffrey Ratcliffe

    I don't think that the problem has to do with the fact that the compression is lossless or lossy. LZW compression is implemented via TIFF - i.e. the images are first converted to TIFF wit LZW compression and then embedded in the PDF. TIFF stores images in strips, if I am not mistaken, and I assume that the bug is in the way PDF::API2 writes the TIFF object.

     
  • Jeffrey Ratcliffe

    Thanks for providing the TIFF, Sebastian.

    Unfortunately, I cannot reproduce the problem on my machine, which tells me that the bug is either in libtiff, or PDF::API2.

    Please start gscan2pdf from the command line:

    gscan2pdf --log=log
    

    import page3.tif, save it as PDF with LZW compression, quit, and post the log file.

     
  • Sebastian Hatt

    Sebastian Hatt - 2016-09-27

    Thanks for your reply. I have attached the log file. Looking at the log file, gscan2pdf appears to use PDF::API2 version 2.020. As far as I can see in Synaptic, my libtiff (libtiff5) is version "4.0.3-7ubuntu0.4". I can confirm that I still get the glitch. I open the TIFF, and everything looks fine, but the PDF looks torn when using LZW.

     
  • Jeffrey Ratcliffe

    Looking at the log file, you can see the following lines which are written whilst the PDF is being saved. The temporary filenames will be different every time, but the process will stay the same. Please post the two files that appear on the first line. In your example, a5XBbdzCfQ.tif should be býte-identical to the page3.tif that was imported. K57w3auZI8.tif, in theory, should be the same, but LZW-compressed. However, according to imagemagick identify, page3.tif was compressed with LZW, so all three should be identical.

    INFO - tiffcp -c lzw /tmp/gscan2pdf-oSHh/a5XBbdzCfQ.tif /tmp/gscan2pdf-oSHh/K57w3auZI8.tif
    INFO - Defining page at 593.28pt x 824.76pt
    INFO - Added /tmp/gscan2pdf-oSHh/K57w3auZI8.tif at 200 PPI
    INFO - Closing PDF
    DEBUG - Finished saving /home/sebastian/PDF/new_test.pdf

    What do the identify and file tools return for the 3 TIFFs? You'll have to check whilst gscan2pdf is running, as it deletes the temporary files before quitting.

     
  • Sebastian Hatt

    Sebastian Hatt - 2016-09-28

    I see only one TIFF file in the temp folder. It is created when "page 3.tif" is loaded and they are identical. The log lists another TIFF, as you said, but it's not in the temp folder. Perhaps it's deleted immediately after the PDF is created? I suppose it doesn't need it anymore, whereas the first TIFF seems to be a working copy of the original.

    1. I ran "gscan2pdf --log=log"
    2. I checked the log file and found the temporary folder: "INFO - Using /tmp/gscan2pdf-tN_O for temporary files"
    3. I looked in the folder and found the files "session" and "lockfile"
    4. I loaded "page 3.tif" in gscan2pdf and a new file appeared in the temp folder: "xlR3SMg2GK.tif".
    5. I saved the PDF, but no new files appeared in the temp folder.
    6. I compared xlR3SMg2GK.tif to "page 3.tif" using "cmp -l page\ 3.tif /tmp/gscan2pdf-tN_O/xlR3SMg2GK.tif" and it found no differences.

    Excerpt from the log:

    DEBUG - save filename dialog returned ok
    DEBUG - Started saving /home/sebastian/PDF/new_test3.pdf
    INFO - Using /usr/share/fonts/truetype/msttcorefonts/Times_New_Roman.ttf for non-ASCII text
    INFO - tiffcp -c lzw /tmp/gscan2pdf-tN_O/xlR3SMg2GK.tif /tmp/gscan2pdf-tN_O/DbQCROeQiq.tif
    INFO - Defining page at 593.28pt x 824.76pt
    INFO - Added /tmp/gscan2pdf-tN_O/DbQCROeQiq.tif at 200 PPI
    INFO - Closing PDF
    DEBUG - Finished saving /home/sebastian/PDF/new_test3.pdf

    The file DbQCROeQiq.tif is nowhere to be found.

    Complete log attached.

     
  • Jeffrey Ratcliffe

    THANK YOU!

    The bug is definitely in PDF::API2, as expected. The following Perl code reproduces it:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    #!/usr/bin/perl -w
    use strict;
    use PDF::API2;
    use Image::Magick;
    
    my $pdf = PDF::API2-> new(-file => 'test.pdf');
    my $tiff = 'page3.tif';
    
    my $image    = Image::Magick->new;
    $image->Read($tiff);
    
    # Get the size and resolution. Resolution is dots per inch, width
    # and height are in inches.
    my $resolution = $image->Get('x-resolution');
    my $w          = $image->Get('width') / $resolution;
    my $h          = $image->Get('height') / $resolution;
    
    my $page = $pdf->page;
    $page->mediabox( $w * 72, $h * 72 );
    
    my $gfx = $page->gfx;
    my $img = $pdf->image_tiff($tiff);
    $gfx->image( $img, 0, 0, $w * 72, $h * 72 );
    
    $pdf->save;
    $pdf->end;
    

    Now I've got something I can report to the author of PDF::API2 and this has finally got a change of getting fixed.

     
  • Jeffrey Ratcliffe

    Note that running the above code, apart from creating the distorted PDF, also produces the following error messages (several times):

    Use of uninitialized value $_[1] in read at /usr/lib/perl5/5.20.2/x86_64-linux-thread-multi/IO/Handle.pm line 463.
    
     

Log in to post a comment.