Request for linux OS users of tesseract

2007-02-27
2013-04-25
  • Filip Gieszczykiewicz

    If you are using linux, have tesseract installed with the TESSDATA_PREFIX, and have recent NetPBM tools I would appreciate if you could do me a favor and tell me if this works for you.

    http://tesseract-ocr.repairfaq.org/downloads/Pat5237627_blocks.tar.bz2 (~1.4MB)

    Basically, I "blocked" the tesseract patent (into 42 blocks) and this scrip tests your tesseract, netpbm tools, and recognizes all the blocks into a text file as a proof of concept. For a copy of what I got when I ran this (keep in mind that block b37.tif crashed v1.03 and has been posted in the Tracker->Bugs :-) see:

    http://tesseract-ocr.repairfaq.org/downloads/Pat5237627.TXT

    (When I have had a chance to review dwdiff, I will modify it to compare THAT output with
    something that has been transcribed.)

    Cheers,
    Fil

    P.S. Does anyone know of a DECENT tool that will read in a PDF file and spit out something controllable size/scale? pdfimage and pdftopnm are not for OCR... just for printing. I'm trying to stick with something scriptable - not using gimp, etc.

     
    • Roger Luethi

      Roger Luethi - 2007-02-27

      Your spec for a decent tool is a tad on the vague side, but have you tried ghostscript?

      Sample call:

      gs -dNOPAUSE -dBATCH -r600 -sDEVICE=pnggray -sOutputFile="test%d.png" foo.pdf

       

Log in to post a comment.