Menu

#14 Superfluous convert command and missing resolution units

v1.0 (example)
closed
None
5
2017-01-16
2016-10-29
No

I've noticed the recent version 0.1.5 does two convert commands. First, it converts a PDF page to a PPM file, then it converts that PPM to TIF. Why doesn't it convert to TIF directly?

Also I noticed, it's missing resolution units:

Image: test.tif
  Format: TIFF (Tagged Image File Format)
  Mime type: image/tiff
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: LSB
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

While tesseract guesses correctly that PixelsPerInch (dpi) are meant, that doesn't work with PNG as an intermediate format as there, tesseract guesses wrong and produces a 90x120cm PDF file.

(And IMHO PNG would be a better intermediate format - not only to save needed temporary disk space, e.g. on small systems like a Raspberry Pi.)

I also opened a ticket for that with tesseract: https://github.com/tesseract-ocr/tesseract/issues/453#issuecomment-257067292

You need to add -unit PixelsPerInch to the convert command to set the units correctly.

(On a side-note: The parameter order for convert changed in recent versions. Now, you're supposed to use SETTINGS INPUT OPERATION(S) OUTPUT. So, while it still works, convert -density 300x300 input.ppm output.tif might break one day. See http://stackoverflow.com/questions/26579299/imagemagick-command-line-option-order-and-categories-of-command-line-parameters for more.)

Discussion

  • Eponymous Archon

    Because of unpaper in the middle, though it's not clear to me why unpaper can't handle png or tiff, since libav does. The developer of unpaper hasn't been working on it for a while, but I've asked him to check in again.

     
  • Tobias Elze

    Tobias Elze - 2017-01-16

    Good points. The seemingly superfluous 2nd call of convert is indeed due to unpaper which cannot deal with tif, and because the unpaper output did was not properly processed by tesseract.

    In version 0.1.6, -unit PixelsPerInch has been added to the convert options, and the order of the convert command line arguments has been changed according to the recommendations.

     
  • Tobias Elze

    Tobias Elze - 2017-01-16
    • status: open --> closed
     

Log in to post a comment.