Tesseract wrapper script for almost any image

Help
Jon Snell
2006-09-08
2013-04-25
  • Jon Snell
    Jon Snell
    2006-09-08

    Here is a wrapper script which will convert just about any image to an appropriate input format for tesseract, run tesseract, clean up and dump the output to stdout

    imagemagick is required.
    Script follows:

    #!/bin/bash                                                                                                                                                               

    tmpid=$$
    # echo TID ${tmpid}                                                                                                                                                       
    convert -compress none -monochrome -resize 200% -density 150x150  $1 /tmp/img.${tmpid}.tif

    # cd into the proper directory, since that's where the data dir will be                                                                                                   
    whichfile=`which $0`
    tdir=`dirname $whichfile`
    pushd $tdir

    ./tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} batch  > /dev/null 2>&1
    cat /tmp/tout.${tmpid}.txt | perl -e 'while(<>) { $_ =~ s/\|//g; $_ =~ s/\^~R//g; print $_  } '
    rm /tmp/tout.${tmpid}.*
    rm /tmp/img.${tmpid}.tif
    popd

     
    • John Boxall
      John Boxall
      2006-09-13

      Tis' a sweet script indeed.

      Howwwwwwwwwwwwwever,

      The convert command leaves something to be desired ... I'm working with jpg images taken straight from a camera - they are slightly blury and have many imperfections - Tesseract is able to parse about 50% of the words, but I'm trying to make it better.

      So far things like
      convert test.jpg  -contrast-stretch 15%  test.jpg
      convert  test.jpg  -fill white  -tint 70 test.jpg
      convert  test.jpg  -level 20%,80%,1.0  test.jpg
      convert test.jpg  -sigmoidal-contrast 10,50% test.jpg
      convert  test.jpg   -sharpen 0x2    test.jpg

      Followed by conversion to tif have been working for me with limited success.

      Does anyone know a set of convert commands the will sharpen text taken from a camera to the point where it can be successfully read by Tesseract?

      Thanks,

      John

       
    • Jon Snell
      Jon Snell
      2006-09-29

      John's changes (which help a ton) added into the wrapper:

      #!/bin/bash

      # check our args
      ARGNO=1         # Number of arguments the script expects.
      E_WRONGARGS=65  # It's nice to enumerate the error codes

      if [ $# -ne $ARGNO ]
      then
        echo "Usage: `basename $0` [arg]" >&2  # Error message >stderr.
        exit $E_WRONGARGS
      fi

      if ! [ -f $1 ]
      then
          echo "File not found: $1"
          exit;
      fi

      # now process and convert our image a bit
      tmpid=$$
      # echo TID ${tmpid}
      TMPNAME=/tmp/tout.${tmpid}.`basename $1 | cut -f2 -d'.'`
      cp $1 $TMPNAME
      # mogrify ${TMPNAME} -contrast-stretch 15% ${TMPNAME}
      convert ${TMPNAME}  -density 150x150 -resize 200% -fill white -tint 50  -level 20%,80%,1.0 -sigmoidal-contrast 30,50%  -sharpen 0x2  -compress none -monochrome  /tmp/img.${tmpid}.tif

      # set up our environ for tesseract, which is very picky
      # cd into the proper directory, since that's where the data dir will be
      whichfile=`which $0`
      tdir=`dirname $whichfile`
      pushd $tdir > /dev/null

      ./tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} batch  > /dev/null 2>&1
      cat /tmp/tout.${tmpid}.txt | perl -e 'while(<>) { chomp(); $_ =~ s/\|//g; $_ =~ s/\^~R//g; if ($_ =~ /\w/) { print $_."\n" ; }} '
      rm /tmp/tout.${tmpid}.*
      rm /tmp/img.${tmpid}.tif
      popd > /dev/null

       
    • teolemon
      teolemon
      2006-10-01

      How can I install this script on Linux ? 

       
    • ith
      ith
      2006-10-28

      it converts one or many files.
      however, did not implement the density and the like ...

      ****
      #!/bin/bash

      DELT=TRUE
      DELG=TRUE
      TARGETDIR=/tmp/
      OUTPUT=/tmp/

      if [[ -z "$1"  ]]
              then
              echo "Usage: $O [-p] [-t TIF-DIR -o TEXT-DIR -f file ]  files_to_OCR
              scans all files with tesseract and outputs the text on STOUT

              OPTIONS:
                      -t TIF-DIR      saves converted TIF images in target-directory TIF-DIR
                      -o TEXT-DIR     saves files created by tesseract in TEXT-DIR
                      -f FILE         saves all text output in file FILE
                      -s              supress output of text on STOUT (only useful with -t or -o;
                                      do not use with -f)

              Examples:
              $0 file1 file2 ...
                      scans the files and displays all the text.

              When STOUT is used for the scanned text, progess and error
              messages are redirected to tes.error.log. -f redirects output
              to file STOUT will show progress and error messages (no log file).
              Scanning takes time... be patient :)
      "

              exit
      fi
      # redirect error to log.file and fd 5 to stout
      #       is changed in -f or -s
      exec 2>tes.error.log
      exec 5>&1

      while getopts ":t:o:f:p" Option
              do
              case $Option in
                      t )     # save converted tif files in directory
                      if [[ -d "$OPTARG" ]]
                      then
                              TARGETDIR="$OPTARG"/
                              DELG=FALSE
                      else
                              echo "-t: $OPTARG  : no such folder"
                              exit
                      fi
                      ;;
                      o )     # save tesseract files in directory
                       if [[ -d "$OPTARG" ]]
                       then
                              OUTPUT="$OPTARG"/
                              DELT=FALSE
                      else
                              echo "-o: $OPTARG  : no such folder"
                              exit
                      fi
                      ;;
                      f )     # no STOUT; save in file
                              FILE="$OPTARG"
                              exec 5>$FILE
                              exec 2>&1
                      ;;
                      s )     # suppress STOUT; show messages
                              exec 2>&1
                              exec 5>/dev/null
                      ;;
              esac
      done

      echo "TARGET-DIR=$TARGETDIR" >&2
      echo "OUTPUT-DIR=$OUTPUT" >&2

      shift $(($OPTIND - 1))

      for i in $*
              do
              if [[ -f "$i" ]]
              then
              if !  identify "$i" 2>/dev/null 1>&5    # check if graphics file can be converted
              then
                      echo "$i: not convertable" >&2
              else
                      echo "processeing: $i ..." >&2
                      # converting the graphic file
                      NEWTIF="$TARGETDIR""${i%\.*}".tif
                      convert "$i" "$NEWTIF"

                      # scanning the newly created tif
                      TEXT="$OUTPUT""${i%\.*}"
                      tesseract "$NEWTIF" "$TEXT"

                      # -p: output scanned text on STOUT
                      #  fd #5 is set in getopts
                      cat "$TEXT".txt >&5

                      #  delete graphic file after use
                      if [[ $DELG == "TRUE" ]]
                              then
                              rm "$NEWTIF"
                      fi

                      # delete tesseract output
                      if [[ $DELT == "TRUE" ]]
                              then
                              rm "$TEXT".map "$TEXT".raw "$TEXT".txt
                      fi

                      # concatenate all text outputs
                      if [[ $CAT == "TRUE" ]]
                              then
                              cat  "$TEXT" >> "$TES"
                      fi

              fi
              fi
      done

       
    • ith
      ith
      2006-10-29

      updated script for tesseract
      converts and scans one or more files.
      ***
      #!/bin/bash

      DELT=TRUE
      DELG=TRUE
      TARGETDIR=/tmp/
      OUTPUT=/tmp/

      if [[ -z "$1"  ]]
              then
              echo "Usage: $O [-p] [-t TIF-DIR -o TESS-DIR -f file ]  files_to_OCR
              scans all files with tesseract and outputs the text on STOUT

              OPTIONS:
                      -t TIF-DIR      saves converted TIF images in target-directory TIF-DIR
                      -o TESS-DIR     saves files created by tesseract in TESS-DIR
                      -f FILE         saves all text output in file FILE
                      -s              supress output of text on STOUT (only useful with -t or -o;
                                      do not use with -f)
                      -c              convert with fill white, resize, sigmoidal-contrast, etc..

              Examples:
              $0 file1 file2 ...
                      scans the files and displays all the text.

              When STOUT is used for the scanned text, progess and error
              messages are redirected to tes.error.log. -f redirects output
              to file STOUT will show progress and error messages (no log file).
              Scanning takes time... be patient :)
      "

              exit
      fi
      # redirect error to log.file and fd 5 to stout
      #       is changed in -f or -s
      exec 2>tes.error.log
      exec 5>&1

      while getopts ":t:o:f:pc" Option
              do
              case $Option in
                      t )     # save converted tif files in directory
                      if [[ -d "$OPTARG" ]]
                      then
                              TARGETDIR="$OPTARG"/
                              DELG=FALSE
                      else
                              echo "-t: $OPTARG  : no such folder"
                              exit
                      fi
                      ;;
                      o )     # save tesseract files in directory
                       if [[ -d "$OPTARG" ]]
                       then
                              OUTPUT="$OPTARG"/
                              DELT=FALSE
                      else
                              echo "-o: $OPTARG  : no such folder"
                              exit
                      fi
                      ;;
                      f )     # no STOUT; save in file
                              FILE="$OPTARG"
                              exec 5>$FILE
                              exec 2>&1
                      ;;
                      s )     # suppress STOUT; show messages
                              exec 2>&1
                              exec 5>/dev/null
                      ;;
                      c )     # convert picture extensively
                              CONV=TRUE
                      ;;
              esac
      done

      echo "TIF-output directory=$TARGETDIR" >&2
      echo "tesseract output directory=$OUTPUT" >&2

      shift $(($OPTIND - 1))

      for i in $*
              do
              I_FILE=${i##.*/}        #just the filename w/o directory
              if [[ ! -f "$i" ]]
              then
                      echo "$i: file not found" >&2
              else
              if !  identify "$i" 2>/dev/null 1>&2    # check if graphics file can be converted
              then
                      echo "$i: not convertable" >&2
              else
                      echo "processeing: $i ..." >&2
                      # converting the graphic file
                      NEWTIF="$TARGETDIR""${I_FILE%\.*}".tif
                      #convert "$i" "$NEWTIF"
                      if [[ $CONV == "TRUE" ]]
                      then
                              convert "$i" -density 150x150 -resize 200% -fill white -tint 50 -level 20%,80%,1.0 -sigmoidal-contrast 30,50% -sharpen 0x2 -compress none -monochrome "$NEWTIF" 1>&2
                      else
                              convert "$i" -density 150x150 -compress none  "$NEWTIF" 1>&2
                      fi

                      # scanning the newly created tif
                      T_FILE="$OUTPUT""${I_FILE%\.*}"
                      tesseract "$NEWTIF" "$T_FILE" 1>&2

                      # -p: output scanned text on STOUT
                      #  fd #5 is set in getopts
                      cat "$T_FILE".txt >&5

                      #  delete graphic file after use
                      if [[ $DELG == "TRUE" ]]
                              then
                              rm "$NEWTIF"
                      fi

                      # delete tesseract output
                      if [[ $DELT == "TRUE" ]]
                              then
                              rm "$T_FILE".map "$T_FILE".raw "$T_FILE".txt
                      fi

                      # concatenate all text outputs
                      if [[ $CAT == "TRUE" ]]
                              then
                              cat  "$T_FILE" >> "$TES"
                      fi

              fi

              fi
      done

       
    • ith
      ith
      2006-10-29

      above scripts are buggy do not use
      i try to create final usable version

       
    • ith
      ith
      2006-10-30