I've noticed the recent version 0.1.5 does two convert commands. First, it converts a PDF page to a PPM file, then it converts that PPM to TIF. Why doesn't it convert to TIF directly?
Also I noticed, it's missing resolution units:
Image: test.tif Format: TIFF (Tagged Image File Format) Mime type: image/tiff Class: DirectClass Geometry: 2479x3500+0+0 Resolution: 300x300 Print size: 8.26333x11.6667 Units: Undefined Type: Grayscale Endianess: LSB Colorspace: Gray Depth: 8-bit Channel depth: gray: 8-bit
While tesseract guesses correctly that PixelsPerInch (dpi) are meant, that doesn't work with PNG as an intermediate format as there, tesseract guesses wrong and produces a 90x120cm PDF file.
(And IMHO PNG would be a better intermediate format - not only to save needed temporary disk space, e.g. on small systems like a Raspberry Pi.)
I also opened a ticket for that with tesseract: https://github.com/tesseract-ocr/tesseract/issues/453#issuecomment-257067292
You need to add -unit PixelsPerInch
to the convert
command to set the units correctly.
(On a side-note: The parameter order for convert
changed in recent versions. Now, you're supposed to use SETTINGS INPUT OPERATION(S) OUTPUT. So, while it still works, convert -density 300x300 input.ppm output.tif
might break one day. See http://stackoverflow.com/questions/26579299/imagemagick-command-line-option-order-and-categories-of-command-line-parameters for more.)
Because of unpaper in the middle, though it's not clear to me why unpaper can't handle png or tiff, since libav does. The developer of unpaper hasn't been working on it for a while, but I've asked him to check in again.
Good points. The seemingly superfluous 2nd call of convert is indeed due to unpaper which cannot deal with tif, and because the unpaper output did was not properly processed by tesseract.
In version 0.1.6, -unit PixelsPerInch has been added to the convert options, and the order of the convert command line arguments has been changed according to the recommendations.