#17 Tesseract 1.03 cannot recognize phototest.tif

open
nobody
None
5
2007-03-23
2007-03-23
Alex Deiter
No

Hi,

Solaris 9, sparc64, gcc 4.1.2
FreeBSD 5.3 i386, gcc 3.4.2

tesseract-1.03 with libtiff support cannot recognize phototest.tif:

$ tesseract phototest.tif text
Tesseract Open Source OCR Engine

$ cat text.txt
pmorvxu qo6 jnwbeq oAeL we gas?` ;ox~
]F1LUbGq OAGL QJG {SEA {OX` j_}.IG dF1!C}(
OAGL [{16 {SEA J`OX~ j_}JG ClI'1!C}( pLOMU qo6
gas?` ;ox~ ipe dngcg pkorvxu qod jnuabeq
j_}JG ClI'1!C}( pLOMU qo6 ]f1!JJbGq OAGL HJG
0% HIS J=OHiJ9I~
OCL COqG *3Uq 266 QJG ![ MOLK2 OU *3}} []xbG2
J.!J!e !e 9 lot 0% JS bO!U{ IGXI to [Gel {IJG

I found workaround for this bug. Please review attached file. Tested on FreeBSD i386/gcc-3.4.2 and Solaris sparc64/gcc-4.1.2:

$ tesseract phototest.tif text
Tesseract Open Source OCR Engine

$ cat text.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Thanks a lot!

Discussion

  • Alex Deiter
    Alex Deiter
    2007-03-23

    Patch for libtiff support

     
  • Logged In: NO

    I tried compiling it on cygwin after patch. It doesnt seem to work.
    It is still giving me exactly same output.

    $ cat text.txt
    pmorvxu qo6 jnwbeq oAeL we gas?` ;ox~
    ]F1LUbGq OAGL QJG {SEA {OX` j_}.IG dF1!C}(
    OAGL [{16 {SEA J`OX~ j_}JG ClI'1!C}( pLOMU qo6
    gas?` ;ox~ ipe dngcg pkorvxu qod jnuabeq
    j_}JG ClI'1!C}( pLOMU qo6 ]f1!JJbGq OAGL HJG
    0% HIS J=OHiJ9I~
    OCL COqG *3Uq 266 QJG ![ MOLK2 OU *3}} []xbG2
    J.!J!e !e 9 lot 0% JS bO!U{ IGXI to [Gel {IJG
    sandeepgiri [ at ] gmail.com

     
  • Logged In: NO

    I get segmentation fault with this patch.

    gcc version 2.95.4 20011002 (Debian prerelease)
    Linux i686

     
  • Peter Fales
    Peter Fales
    2007-06-15

    Logged In: YES
    user_id=150101
    Originator: NO

    That patch didn't quite work for me (it caused tesseract to crash), but it gave me the clue to get a fix that worked for me. The original code was reading lines from the tiff file one at a time filling the buffer from the "bottom" to the "top." This resulted in trying to recognize characters that are upside down and which is the reason for the garbage in the scan results. The patch reverses the order of filling the buffer, but because it increments the pointer before storing the first line, the last line overruns the end of the buffer.

     
  • Logged In: NO

    Yes psfales suggestion works perfectly.
    I made the follwing change in tessedit.cpp
    /* Code start here */
    #if 0
    int bytes_per_line = (image_width*bpp + 7)/8;
    UINT8* dest_buf = image->get_buffer() + bytes_per_line*image_height;
    #else
    uint32 bytes_per_line = (image_width*bpp + 7)/8;
    UINT8* dest_buf = image->get_buffer();
    #endif

    // This will go badly wrong with one of the more exotic tiff formats,
    // but the majority will work OK.
    for (uint32 y = 0; y < image_height; ++y) {
    TIFFReadScanline(tif, buf, y);
    memcpy(dest_buf, buf, bytes_per_line);
    dest_buf += bytes_per_line;
    }

    /* Ends here*/

    And here is the output

    [root@PCDEVLIN tesseract-1.03]# ./tesseract phototest.tif new
    Tesseract Open Source OCR Engine

    [root@PCDEVLIN tesseract-1.03]# cat new.txt
    This is a lot of 12 point text to test the
    ocr code and see if it works on all types
    of file format.
    The quick brown dog jumped over the
    lazy fox. The quick brown dog jumped
    over the lazy fox. The quick brown dog
    jumped over the lazy fox. The quick
    brown dog jumped over the lazy fox.