Trying gocr and ocrad on the trivial case of ocr, namely black/white
printed printed digital text, I was not quite fine with the results.
The following stumbling blocks occured, and the observed outcomes of
different versions, sorted by subjective severity of the error:
a) Ligature "ft" b) Character "ß"
a2) "ft" via db mode b2) "ß" via db mode
a3) "_" b3) "_"
a4) "R" in spite of db mode b4) "R" in spite of db mode
c) Ligature "rw" d) Ligature "rt"
c1) "rw" d1) "rt"
c2) "rw" via db mode d2) "rt" via db mode
c3) "rrw" in spite of db mode d3) "rrt" in spite of db mode
e) Character "x" f) Character "g"
e1) "x" f1) "g"
e2) "x" via db mode f2) "g" via db mode
f3) "9" (nine)
g) Character "w" h) Character "W"
g1) "w" h1) "W"
g2) "w" via db mode h2) "W" via db mode
g3) "N"
i) Character "l" vs. "I" j) Character ":"
j1) ":"
i2) all "l" except in capitals
i3) some "I" as "l", capitals correct j3) "="
j4) ".." after asking for "."
k) Character "?"
l) Character "!"
k1) "?" l1) "!"
k3) "7." l3) "!."
m) The sequence "..." n) Punctuation order
m1) "..." n1) ok
m2) "...", leading space missing n4) one line too high
m3) sometimes "."/"..", sometimes "..."
m4) "...", sometimes ".. ."
General note: gocr sometimes produces different result depending on whether
some db entry was just made or was in the db from a previous run.
Also note that "l" and "I" were not optically distinguishable, so no useful
result can be expected there. In the combinations "rt", "rw", and "ft" the
first letter touched the second.
gocr was always run with the options -m 130 -f UTF8 -s 10 -p somepath/ and
(for each binary a different) database. The last binary is current cvs plus
a patch that fixes the broken db (but re-introduced another bug).
Listed after each executable were the characters asked for. The "." was
always bogus. Then the outcomes for the listed problems.
abcdefghijklmn
gocr 0.39: ß rt rw ft g . x 22332211341124
gocr @2004-02-13: ß rt rw ft g . x 22332211341124
gocr @2004-02-14: ß rt rw ft g . x 22332211341124
gocr @2004-02-20: ß rt rw ft g . x 22332211341124
gocr @2004-02-24: ß rt rw ft g . x 22332211341124
gocr @2004-03-17: ß rt rw ft g . x 22332211341124
gocr @2004-09-08: ß rt rw ft g . x 22332211341124
gocr @2004-10-09: ß rt rw ft g . x 22332211341124
gocr @2004-10-12: ß rt rw ft g . x 22332211341124
gocr @2004-11-26: ß rt rw ft g . x w W 22332222341124
gocr @2005-01-11: ß rt rw ft g . x w W 22332222341124
gocr @2005-01-26: ß rt rw ft g . x w W 22332222341124
gocr @2005-02-24: ß rt rw ft g . x w W 22332222341124
gocr @2005-03-01: ß rt rw ft g . x w W 44222222341124
gocr @2005-03-08: ß rt rw ft g . x w W 44222222341124
gocr @2005-03-09: ß rt rw ft g . x w W 44222222341114
gocr @2005-03-11: ß rt rw ft g . w W 44221222341114
gocr @2005-03-12: ß rt rw ft g . w W 44221222343314
gocr 0.40: ß rt rw ft g . w W 44221222333311
gocr @2005-03-14: ß rt rw ft g . w W 44221222343314
gocr @2005-03-15: ß rt rw ft g . w W 44221222343314
gocr @2005-03-21: ß rt rw ft g . w W 44221222343314
gocr @2005-03-23: ß rt rw ft . w W 44221322343314
gocr @2005-03-31: ß rt rw ft . w W 44221322343314
gocr @2005-04-03: ß rt rw ft . w W 44221322343314
gocr @2005-04-19: ß rt rw ft . w W 44221322343314
gocr @2005-05-19: ß rt rw ft W 44221332311141
gocr @2005-05-20: ß rt rw ft W 44221332311141
gocr @2005-05-24: ß rt rw ft W 44221332331141
gocr @2006-01-08-fix: ß rt rw ft W 22331332331141
The versions where the dates where changes were made in cvs. Since the sf
cvs is pretty broken currently, I may have missed ones. Between compiles I
did make clean. However, release 0.40 does not fit into the scheme as it
behaves differently than the versions around it.
Additionally, database-only mode was tried. Problems: "..." was detected
as ". . .", bug c3) / d3), '' detected as ll, other spacing problems.
For comparison, the current version of ocrad (output with s/"/''/g and
empty lines dropped for comparison):
Ocrad 0.13: 33111111211131
* "ist ja" detected as "istja"
* one s capitalized wrong
Interesting that it doesn't stumble over the "rw" and "rt" combinations.
If ocrad had a database mode, it would certainly give the best results.
Here's the patch to fix the database problems (note the tarball you
mentioned on this list does _not_ fix the problem):
ruediger@...> cvs di lines.c | tee diff
Index: lines.c
===================================================================
RCS file: /cvsroot/jocr/jocr/src/lines.c,v
retrieving revision 1.32
diff -u -r1.32 lines.c
--- lines.c 18 May 2005 08:49:07 -0000 1.32
+++ lines.c 8 Jan 2006 18:28:43 -0000
@@ -278,6 +278,12 @@
if (box2->num_ac>1) { /* output alist */
}
}
+ if (box2->obj && (JOB->cfg.out_format!=XML || box2->obj[0]!='<')) {
+ buffer=append_to_line(buffer,box2->obj,&len);
+ j+=strlen(box2->obj);
+ /* should we free box2->obj here and reset? not very clean */
+ free(box2->obj); box2->obj=NULL;
+ } else
if (box2->c != UNKNOWN && box2->c!=0) {
buffer=
append_to_line(buffer,decode(box2->c,JOB->cfg.out_format),&len);
@@ -285,16 +291,9 @@
box2->c <= 'z') i2++; /* count non-space chars */
} else {
wchar_t cc; cc=box2->c;
- if (box2->obj && (JOB->cfg.out_format!=XML || box2->obj[0]!='<')) {
- buffer=append_to_line(buffer,box2->obj,&len);
- j+=strlen(box2->obj);
- /* should we free box2->obj here and reset? not very clean */
- free(box2->obj); box2->obj=NULL;
- } else {
if (cc==UNKNOWN && !(mo & 8)) { cc=box2->ac; } /* take alternate char */
buffer=
append_to_line(buffer,decode(cc,JOB->cfg.out_format),&len);
- }
}
if (JOB->cfg.out_format==XML) {
if (box2->num_ac>0) { /* output alist */
Additionally, there are a few things wierd in cvs. There are generated
Makefiles checked in - they'll be overwritten by configure anyway, and there
are no .cvsignore hiding the generated files from cvs.
./bin/.cvsignore:.cvsignore
./bin/.cvsignore:gocr
./doc/.cvsignore:Makefile
./doc/.cvsignore:.cvsignore
./man/.cvsignore:Makefile
./man/.cvsignore:.cvsignore
./src/api/.cvsignore:Makefile
./src/api/.cvsignore:.cvsignore
./src/.cvsignore:Makefile
./src/.cvsignore:.cvsignore
./src/.cvsignore:gocr
./.cvsignore:Makefile
./.cvsignore:.cvsignore
./include/.cvsignore:config.h
./include/.cvsignore:.cvsignore
Also, the file confdefs.h is deleted on make clean, but in cvs.
--
"See, free nations are peaceful nations. Free nations don't attack
each other. Free nations don't develop weapons of mass destruction."
- George W. Bush, Milwaukee, Wis., Oct. 3, 2003
|