Optical Character Recognition (GOCR) / Patches / #5 fixes mixup where gocr sees d as a

#5 fixes mixup where gocr sees d as a

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-03-29

Created: 2006-09-11

Creator: Dennis Sheil

Private: No

Hello,

I noticed with gocr, lowercase d's are sometimes
mistaken for a's. Recently I downloaded some of the
scanned pages from Project Gutenberg's Distributed
Proofreaders ( http://www.pgdp.net ). The pages were
from different books with different fonts, scanned by
different people using different methods - some were
black and white, some black and slightly yellow etc.
For the majority of these pages, an "a" was mistaken
for a "d" at least once.

The reasons for this are two-fold. For one, there is a
check in the ocr0_dD() function which is a little bit
too strict. The one I mean is:

for(i=dx/8+1,x=0;x<dx && i;x++){
if( num_cross(x ,x ,0 ,dy-1, bp,cs) == 2 ) i--;
} if( i ) Break;

The main problem with this is often there is a serif at
the top of the d. On tests I ran on very different
pages, "i" often got down to 1 and stayed there while x
climbed up to dx. This is usually due to a serif at
the top of d preventing the num_cross from seeing
enough 2's.

I also had a scanned page where there was a serif on
top of the d, but it was not so large. The problem was
in the scan and font, there was a slight bulge with the
d's circle on the bottom. If the circle on the bottom
of the d was a clock, you could say the bulge was at 7
PM. Thus, when the num_cross check started, it would
count 3 when it went through the bulge, and would again
count 3 when it reached the small serif. Thus, once
again, i was 1 while x became equal to dx. So if a
bulge appears in the circle at the bottom of d, this
can also contribute to the check failing. Although in
most cases, the serif on top juts out just a little too
far.

So this is why the character is not seen as a d. From
what I understand of gocr's philosophy, this is not a
big deal. From what I've read you say, you're more
concerned about false positives than unknown letters.
The question now is, why is it seen as an a?

The reason once again is due to the serif at top. With
a serif jutting leftward, the shape of d is similar to
that of a. Usually to my eyes, the serif is not
jutting out far enough to look like an a. Perhaps a
stricter num_cross check is in order. But there's
something even more apparent.

With gocr's m1 being the upper bound of characters like
E, m2 being the upperbound of characters like e, m3
being the lower bound of characters like e or the
baseline, and m4 being the lower bound of characters
like g, there is no reason d should ever be mistaken
for a. But it seems that a never checks for this.

My patch does two things - it removes the +1 from i in
the aforementioned for loop in ocr0_dD(). The new
function is:

for(i=dx/8,x=0;x<dx && i;x++){
if( num_cross(x ,x ,0 ,dy-1, bp,cs) == 2 ) i--;
} if( i ) Break;

This allows for the kind of serif you quite often find
on top of a d.

The other change I make is in ocr0_aA(). The function
currently looks into the box struct for x0, x1, y0 and
y1. I look for m1 as well. If m1-y0 is greater than
or equal to 0, I break.

This may not be the best way to fix this for a, and if
you have a better method of measuring the character up
against m1, your way would probably be better. I tried
it this way and it has worked well.

I chose the value 0, because I noticed that when a was
correctly identified, it's m1 was usually between -9
and -14. When a d was mistakenly identified as an a,
it's m1 was usually between 0 and 3. If I did even
more tests I'm sure I would see a higher range. But it
seems to me, there is no reason that m1 on an "a" would
get all the way up to 0. So I set it at that.

So I applied my patch and ran it on most of the pbm's
and pcx's in the examples directory. For all the tests
I ran, my patch had no effect on the outputs. Then I
ran my patch on a number of different pages from books
from Distributed Proofreaders. My patch had no
negative effects on any of my tests, only positive
ones. For pages where d's were mistaken for a's (which
is quite a number of them), many of the d's which had
been seen as a's were correctly seen as d's.

So to reiterate, when testing my patch against included
examples, I saw no changes. When testing against
real-world scans from different books with different
fonts from the Distributed Proofreaders site, I saw a
number of times that "d" was correctly recognized as
"d", and not "a" like in the current CVS of gocr.

Discussion

Dennis Sheil - 2006-09-11

patch to ocr0.c

ocr0.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dennis Sheil - 2006-09-18

Logged In: YES
user_id=1595165

In terms of possible changes, one thing to note is I did not
test this on very small and large characters. Perhaps a
setac() would be better than a break if the y lengths are
not in tune.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joerg Schulenburg - 2009-03-29

accepted with modifications, but a small (cropped) sample would have been helpful for tests

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joerg Schulenburg - 2009-03-29

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

fixes mixup where gocr sees d as a

Group

Searches

Help

#5 fixes mixup where gocr sees d as a

Discussion