If you are using linux, have tesseract installed with the TESSDATA_PREFIX, and have recent NetPBM tools I would appreciate if you could do me a favor and tell me if this works for you.
Basically, I "blocked" the tesseract patent (into 42 blocks) and this scrip tests your tesseract, netpbm tools, and recognizes all the blocks into a text file as a proof of concept. For a copy of what I got when I ran this (keep in mind that block b37.tif crashed v1.03 and has been posted in the Tracker->Bugs :-) see:
(When I have had a chance to review dwdiff, I will modify it to compare THAT output with
something that has been transcribed.)
P.S. Does anyone know of a DECENT tool that will read in a PDF file and spit out something controllable size/scale? pdfimage and pdftopnm are not for OCR... just for printing. I'm trying to stick with something scriptable - not using gimp, etc.
Your spec for a decent tool is a tad on the vague side, but have you tried ghostscript?
gs -dNOPAUSE -dBATCH -r600 -sDEVICE=pnggray -sOutputFile="test%d.png" foo.pdf
Log in to post a comment.