[Jocr-devels] Request for an overview

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've thought OCR would be interesting for a long time and I now also have 
a need to read some scanned documents, so it seems a good idea to put the
motivation to use.

My aims would be a) to link gocr to my favorite language (pike 
http://pike.roxen.com/) both for easy passing of images to gocr and for writing
modules for image preprocessing and character recognition.
b) to get it to deal better with the texts I have. Some combinations of letters
are not recognised eg "Th" which probably is due to the gap being too small
and suggests to me that these combinations need to be added as if they were a
single character. I have quite a lot of texts with spaces with a line for
people to write answers on and it would be nice to get a series of underscores.
Also recognition of underlined text.

The recognition technique described works from the topology of various letters.
I wonder whether a cache of recognised letter images could be simply XORed 
to find a match. Obviously in the extreme case of the bitmaps being identical
the result must logicly be the same as it was previously. This should certainly
work well with texts from grabbed from the screen, which I'm also interested
in, and with a connection to a scripting language which can control the 
scanner, I could scan or rescan at high definition to find a predictable
origin for the image of a letter and then resample down, which may reduce
differences between the image of the same letter depending on movements of
less than one pixel. Some small and not clustered differences would lower
the confidence only slightly, but a cluster of diferences may indicate
letters with accents etc. If I get no exact match, but it looks like "h" 
with something at the bottom, "o" with something at the bottom then "w"
with something at the bottom, probably it is "how" underlined. So I may
get underlined text like this and not need to add underlined versions of each
character. Maybe there is a big problem with this technique in practice,
I'd be pleased is anyone who knows one could tell me.

Clearly to do this I need to be able to have a module which can learn by
receiving both the image to make an attempt at itself and receive the final
character decided upon by other techniques.

I see that the delared aim is to move to a gocr built around libgocr, but
I also see that new gocr versions are released quite often and there is no
new libgocr. I'm not sure whether I should look at the separately downloadable
libgocr, or the files in gocr-0.41 api. Also I see references  in 
gocr-0.41/api/doc/api.txt to an MDK, but can't see one to download on 
sourceforge.

Could  someone please point me in the direction of the sourcecode and 
documentation I should start with.

Yours
Ian