Hi,
Solaris 9, sparc64, gcc 4.1.2
FreeBSD 5.3 i386, gcc 3.4.2
tesseract-1.03 with libtiff support cannot recognize phototest.tif:
$ tesseract phototest.tif text
Tesseract Open Source OCR Engine
$ cat text.txt
pmorvxu qo6 jnwbeq oAeL we gas?` ;ox~
]F1LUbGq OAGL QJG {SEA {OX` j_}.IG dF1!C}(
OAGL [{16 {SEA J`OX~ j_}JG ClI'1!C}( pLOMU qo6
gas?` ;ox~ ipe dngcg pkorvxu qod jnuabeq
j_}JG ClI'1!C}( pLOMU qo6 ]f1!JJbGq OAGL HJG
0% HIS J=OHiJ9I~
OCL COqG *3Uq 266 QJG ![ MOLK2 OU *3}} []xbG2
J.!J!e !e 9 lot 0% JS bO!U{ IGXI to [Gel {IJG
I found workaround for this bug. Please review attached file. Tested on FreeBSD i386/gcc-3.4.2 and Solaris sparc64/gcc-4.1.2:
$ tesseract phototest.tif text
Tesseract Open Source OCR Engine
$ cat text.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
Thanks a lot!
Patch for libtiff support
Logged In: NO
I tried compiling it on cygwin after patch. It doesnt seem to work.
It is still giving me exactly same output.
$ cat text.txt
pmorvxu qo6 jnwbeq oAeL we gas?` ;ox~
]F1LUbGq OAGL QJG {SEA {OX` j_}.IG dF1!C}(
OAGL [{16 {SEA J`OX~ j_}JG ClI'1!C}( pLOMU qo6
gas?` ;ox~ ipe dngcg pkorvxu qod jnuabeq
j_}JG ClI'1!C}( pLOMU qo6 ]f1!JJbGq OAGL HJG
0% HIS J=OHiJ9I~
OCL COqG *3Uq 266 QJG ![ MOLK2 OU *3}} []xbG2
J.!J!e !e 9 lot 0% JS bO!U{ IGXI to [Gel {IJG
sandeepgiri [ at ] gmail.com
Logged In: NO
I get segmentation fault with this patch.
gcc version 2.95.4 20011002 (Debian prerelease)
Linux i686
Logged In: YES
user_id=150101
Originator: NO
That patch didn't quite work for me (it caused tesseract to crash), but it gave me the clue to get a fix that worked for me. The original code was reading lines from the tiff file one at a time filling the buffer from the "bottom" to the "top." This resulted in trying to recognize characters that are upside down and which is the reason for the garbage in the scan results. The patch reverses the order of filling the buffer, but because it increments the pointer before storing the first line, the last line overruns the end of the buffer.
Logged In: NO
Yes psfales suggestion works perfectly.
I made the follwing change in tessedit.cpp
/* Code start here */
#if 0
int bytes_per_line = (image_width*bpp + 7)/8;
UINT8* dest_buf = image->get_buffer() + bytes_per_line*image_height;
#else
uint32 bytes_per_line = (image_width*bpp + 7)/8;
UINT8* dest_buf = image->get_buffer();
#endif
// This will go badly wrong with one of the more exotic tiff formats,
// but the majority will work OK.
for (uint32 y = 0; y < image_height; ++y) {
TIFFReadScanline(tif, buf, y);
memcpy(dest_buf, buf, bytes_per_line);
dest_buf += bytes_per_line;
}
/* Ends here*/
And here is the output
[root@PCDEVLIN tesseract-1.03]# ./tesseract phototest.tif new
Tesseract Open Source OCR Engine
[root@PCDEVLIN tesseract-1.03]# cat new.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.