Share

Tesseract OCR

The forum address has changed, you have been automatically redirected. Please update any bookmarks to use the new URL.

Subscribe

Tesseract wrapper script for almost any image

  1. 2006-09-08 04:25:07 UTC
    Here is a wrapper script which will convert just about any image to an appropriate input format for tesseract, run tesseract, clean up and dump the output to stdout

    imagemagick is required.
    Script follows:

    #!/bin/bash

    tmpid=$$
    # echo TID ${tmpid}
    convert -compress none -monochrome -resize 200% -density 150x150 $1 /tmp/img.${tmpid}.tif

    # cd into the proper directory, since that's where the data dir will be
    whichfile=`which $0`
    tdir=`dirname $whichfile`
    pushd $tdir

    ./tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} batch > /dev/null 2>&1
    cat /tmp/tout.${tmpid}.txt | perl -e 'while(<>) { $_ =~ s/\|//g; $_ =~ s/\^~R//g; print $_ } '
    rm /tmp/tout.${tmpid}.*
    rm /tmp/img.${tmpid}.tif
    popd
  2. 2006-09-13 09:08:52 UTC
    Tis' a sweet script indeed.

    Howwwwwwwwwwwwwever,

    The convert command leaves something to be desired ... I'm working with jpg images taken straight from a camera - they are slightly blury and have many imperfections - Tesseract is able to parse about 50% of the words, but I'm trying to make it better.

    So far things like
    convert test.jpg -contrast-stretch 15% test.jpg
    convert test.jpg -fill white -tint 70 test.jpg
    convert test.jpg -level 20%,80%,1.0 test.jpg
    convert test.jpg -sigmoidal-contrast 10,50% test.jpg
    convert test.jpg -sharpen 0x2 test.jpg

    Followed by conversion to tif have been working for me with limited success.

    Does anyone know a set of convert commands the will sharpen text taken from a camera to the point where it can be successfully read by Tesseract?


    Thanks,

    John
  3. 2006-09-29 21:30:14 UTC
    John's changes (which help a ton) added into the wrapper:

    #!/bin/bash

    # check our args
    ARGNO=1 # Number of arguments the script expects.
    E_WRONGARGS=65 # It's nice to enumerate the error codes

    if [ $# -ne $ARGNO ]
    then
    echo "Usage: `basename $0` [arg]" >&2 # Error message >stderr.
    exit $E_WRONGARGS
    fi

    if ! [ -f $1 ]
    then
    echo "File not found: $1"
    exit;
    fi

    # now process and convert our image a bit
    tmpid=$$
    # echo TID ${tmpid}
    TMPNAME=/tmp/tout.${tmpid}.`basename $1 | cut -f2 -d'.'`
    cp $1 $TMPNAME
    # mogrify ${TMPNAME} -contrast-stretch 15% ${TMPNAME}
    convert ${TMPNAME} -density 150x150 -resize 200% -fill white -tint 50 -level 20%,80%,1.0 -sigmoidal-contrast 30,50% -sharpen 0x2 -compress none -monochrome /tmp/img.${tmpid}.tif

    # set up our environ for tesseract, which is very picky
    # cd into the proper directory, since that's where the data dir will be
    whichfile=`which $0`
    tdir=`dirname $whichfile`
    pushd $tdir > /dev/null

    ./tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} batch > /dev/null 2>&1
    cat /tmp/tout.${tmpid}.txt | perl -e 'while(<>) { chomp(); $_ =~ s/\|//g; $_ =~ s/\^~R//g; if ($_ =~ /\w/) { print $_."\n" ; }} '
    rm /tmp/tout.${tmpid}.*
    rm /tmp/img.${tmpid}.tif
    popd > /dev/null
  4. 2006-10-01 13:53:32 UTC
    How can I install this script on Linux ?
  5. 2006-10-28 03:09:52 UTC
    it converts one or many files.
    however, did not implement the density and the like ...

    ****
    #!/bin/bash

    DELT=TRUE
    DELG=TRUE
    TARGETDIR=/tmp/
    OUTPUT=/tmp/

    if [[ -z "$1" ]]
    then
    echo "Usage: $O [-p] [-t TIF-DIR -o TEXT-DIR -f file ] files_to_OCR
    scans all files with tesseract and outputs the text on STOUT

    OPTIONS:
    -t TIF-DIR saves converted TIF images in target-directory TIF-DIR
    -o TEXT-DIR saves files created by tesseract in TEXT-DIR
    -f FILE saves all text output in file FILE
    -s supress output of text on STOUT (only useful with -t or -o;
    do not use with -f)

    Examples:
    $0 file1 file2 ...
    scans the files and displays all the text.

    When STOUT is used for the scanned text, progess and error
    messages are redirected to tes.error.log. -f redirects output
    to file STOUT will show progress and error messages (no log file).
    Scanning takes time... be patient :)
    "

    exit
    fi
    # redirect error to log.file and fd 5 to stout
    # is changed in -f or -s
    exec 2>tes.error.log
    exec 5>&1


    while getopts ":t:o:f:p" Option
    do
    case $Option in
    t ) # save converted tif files in directory
    if [[ -d "$OPTARG" ]]
    then
    TARGETDIR="$OPTARG"/
    DELG=FALSE
    else
    echo "-t: $OPTARG : no such folder"
    exit
    fi
    ;;
    o ) # save tesseract files in directory
    if [[ -d "$OPTARG" ]]
    then
    OUTPUT="$OPTARG"/
    DELT=FALSE
    else
    echo "-o: $OPTARG : no such folder"
    exit
    fi
    ;;
    f ) # no STOUT; save in file
    FILE="$OPTARG"
    exec 5>$FILE
    exec 2>&1
    ;;
    s ) # suppress STOUT; show messages
    exec 2>&1
    exec 5>/dev/null
    ;;
    esac
    done


    echo "TARGET-DIR=$TARGETDIR" >&2
    echo "OUTPUT-DIR=$OUTPUT" >&2


    shift $(($OPTIND - 1))


    for i in $*
    do
    if [[ -f "$i" ]]
    then
    if ! identify "$i" 2>/dev/null 1>&5 # check if graphics file can be converted
    then
    echo "$i: not convertable" >&2
    else
    echo "processeing: $i ..." >&2
    # converting the graphic file
    NEWTIF="$TARGETDIR""${i%\.*}".tif
    convert "$i" "$NEWTIF"

    # scanning the newly created tif
    TEXT="$OUTPUT""${i%\.*}"
    tesseract "$NEWTIF" "$TEXT"

    # -p: output scanned text on STOUT
    # fd #5 is set in getopts
    cat "$TEXT".txt >&5

    # delete graphic file after use
    if [[ $DELG == "TRUE" ]]
    then
    rm "$NEWTIF"
    fi

    # delete tesseract output
    if [[ $DELT == "TRUE" ]]
    then
    rm "$TEXT".map "$TEXT".raw "$TEXT".txt
    fi

    # concatenate all text outputs
    if [[ $CAT == "TRUE" ]]
    then
    cat "$TEXT" >> "$TES"
    fi

    fi
    fi
    done
  6. 2006-10-29 15:59:09 UTC
    updated script for tesseract
    converts and scans one or more files.
    ***
    #!/bin/bash

    DELT=TRUE
    DELG=TRUE
    TARGETDIR=/tmp/
    OUTPUT=/tmp/

    if [[ -z "$1" ]]
    then
    echo "Usage: $O [-p] [-t TIF-DIR -o TESS-DIR -f file ] files_to_OCR
    scans all files with tesseract and outputs the text on STOUT

    OPTIONS:
    -t TIF-DIR saves converted TIF images in target-directory TIF-DIR
    -o TESS-DIR saves files created by tesseract in TESS-DIR
    -f FILE saves all text output in file FILE
    -s supress output of text on STOUT (only useful with -t or -o;
    do not use with -f)
    -c convert with fill white, resize, sigmoidal-contrast, etc..

    Examples:
    $0 file1 file2 ...
    scans the files and displays all the text.

    When STOUT is used for the scanned text, progess and error
    messages are redirected to tes.error.log. -f redirects output
    to file STOUT will show progress and error messages (no log file).
    Scanning takes time... be patient :)
    "

    exit
    fi
    # redirect error to log.file and fd 5 to stout
    # is changed in -f or -s
    exec 2>tes.error.log
    exec 5>&1


    while getopts ":t:o:f:pc" Option
    do
    case $Option in
    t ) # save converted tif files in directory
    if [[ -d "$OPTARG" ]]
    then
    TARGETDIR="$OPTARG"/
    DELG=FALSE
    else
    echo "-t: $OPTARG : no such folder"
    exit
    fi
    ;;
    o ) # save tesseract files in directory
    if [[ -d "$OPTARG" ]]
    then
    OUTPUT="$OPTARG"/
    DELT=FALSE
    else
    echo "-o: $OPTARG : no such folder"
    exit
    fi
    ;;
    f ) # no STOUT; save in file
    FILE="$OPTARG"
    exec 5>$FILE
    exec 2>&1
    ;;
    s ) # suppress STOUT; show messages
    exec 2>&1
    exec 5>/dev/null
    ;;
    c ) # convert picture extensively
    CONV=TRUE
    ;;
    esac
    done


    echo "TIF-output directory=$TARGETDIR" >&2
    echo "tesseract output directory=$OUTPUT" >&2


    shift $(($OPTIND - 1))


    for i in $*
    do
    I_FILE=${i##.*/} #just the filename w/o directory
    if [[ ! -f "$i" ]]
    then
    echo "$i: file not found" >&2
    else
    if ! identify "$i" 2>/dev/null 1>&2 # check if graphics file can be converted
    then
    echo "$i: not convertable" >&2
    else
    echo "processeing: $i ..." >&2
    # converting the graphic file
    NEWTIF="$TARGETDIR""${I_FILE%\.*}".tif
    #convert "$i" "$NEWTIF"
    if [[ $CONV == "TRUE" ]]
    then
    convert "$i" -density 150x150 -resize 200% -fill white -tint 50 -level 20%,80%,1.0 -sigmoidal-contrast 30,50% -sharpen 0x2 -compress none -monochrome "$NEWTIF" 1>&2
    else
    convert "$i" -density 150x150 -compress none "$NEWTIF" 1>&2
    fi

    # scanning the newly created tif
    T_FILE="$OUTPUT""${I_FILE%\.*}"
    tesseract "$NEWTIF" "$T_FILE" 1>&2

    # -p: output scanned text on STOUT
    # fd #5 is set in getopts
    cat "$T_FILE".txt >&5

    # delete graphic file after use
    if [[ $DELG == "TRUE" ]]
    then
    rm "$NEWTIF"
    fi

    # delete tesseract output
    if [[ $DELT == "TRUE" ]]
    then
    rm "$T_FILE".map "$T_FILE".raw "$T_FILE".txt
    fi

    # concatenate all text outputs
    if [[ $CAT == "TRUE" ]]
    then
    cat "$T_FILE" >> "$TES"
    fi

    fi

    fi
    done

  7. 2006-10-29 23:06:09 UTC
    above scripts are buggy do not use
    i try to create final usable version
  8. 2006-10-30 16:07:06 UTC
    all further versions here. http://www.geocities.com/thierryguy/tess_wrap.html
< Previous | 1 | Next >

Add a Reply

This forum does not allow anonymous participation.

Log in to add a reply. Not registered? Create an account to participate and receive email updates when replies are posted to this topic.