I had some thoughts about OCR engines and layout detection lately and finally had some time to write it all down. I'm not a skilled programmer myself, so most of it is written from a user-viewpoint and may be harder to implement than it sounds/reads.
This is not directly related to Tesseract OCR, but aims to be a general guideline how things _should_ work IMHO. It may even become a completely new software.
The attached document describes a framework of how i imagine an extensible/modular scan- and recognition workflow.
Any part of this workflow should be usable (as plugins or libraries) with or within the existing scan software such as Kooka, Tesseract, Ocrad and more...
More info inside the attached pdf....
So long,
Werner
Related feature requests:
Modularity
http://sourceforge.net/tracker/index.php?func=detail&aid=1556077&group_id=158586&atid=808427
form recognition
http://sourceforge.net/tracker/index.php?func=detail&aid=1554343&group_id=158586&atid=808427
Support PNG images
http://sourceforge.net/tracker/index.php?func=detail&aid=1552481&group_id=158586&atid=808427
Barcode recognition
http://sourceforge.net/tracker/index.php?func=detail&aid=1550978&group_id=158586&atid=808427
Version 1.0 (2006-12-30)
Logged In: YES
user_id=37894
Originator: NO
Well, you don't have to be a skilled C++ programmer to make a stab at it - if that disqualified one from helping out I should leave :-)
Get libtiff and look through it. Get libpng and look through it. Look at how tesseractmain.cpp (in ccmain/) uses libtiff and add an #ifdef blah, #else, #endif
around that and try to get it to read PNG. I don't think that would be hard.
Form recognition is NOT something someone can just "hack". That requires a decent review of literature and decent programming skills - say, a grad student looking for a nice Masters, using tess as a building block :-) Zoning that WORKS takes brains.
Modularity you can forget about. The best you can hope for is for tess to remain an
"engine" and not tied to some all-inclusive GUI system... ;-) [inside joke for now]
Barcode recognition has to wait until zoning... OR until someone takes gocr and "extracts" its barcode support and integrates it into tess.
Cheers,
Fil
P.S. Also, please realize that for most sourceforgers, these projects are a past-time hobby. I hope that when the tess website comes on-line, you can convert the PDF file into more of a road-map with an analysis of existing GNU tools "plotted" on it. That would give folks looking for something to hack on an excellent "overview" of what works and what does not and what needs more work/rework to improve.
Logged In: YES
user_id=1434318
Originator: YES
I actually know quite well how sourceforge projects work since I'm already involved in some in my spare time next to work ... which is also taking up most of what is left of it, so i just summed up my thoughts on this (and I'm sure most of it has been suggested before a hundred times). Use it or not, it's up to the one whom it might help. :)
> Well, you don't have to be a skilled C++ programmer to make a stab at it -
> if that disqualified one from helping out I should leave :-)
> [SNIP]
> Form recognition is NOT something someone can just "hack". That requires a
> decent review of literature and decent programming skills - say, a grad
> student looking for a nice Masters, using tess as a building block :-)
> Zoning that WORKS takes brains.
What now? "Not a skilled programmer" or "decent programming skills" ;) But I know what you mean, don't worry :)
Re on modular approach (and general):
I specifically mentioned that this is not centered on Tesseract (but where else to post this?).
It's actually as you said: Tesseract (or whatever OCR software) would only be used as an (text) engine inside the (new?) framework, that's exactly what I'm talking about. Modification of an engine to make it into a modular framework is not really a good approach anyway.
One thing feed to other ImageReader->LayoutRecognition->Text/table/whatever recognition
Image conversion:
Hacking different image formats into Tesseract is not really a good. Using a generic framework to feed the Tesseract (the engine) a unified image format would be the better approach.
barcodes:
Same is true here - Extracting barcode support from gocr is only a good idea if it doesn't work out in gocr (so why extract it and integrated it in another ocr program in the first place?) - I rather think is should be used as a lib as well. If it can be used in such a way already - even better better.
GUI:
GUI? What a joke ;) that would be the last thing on my list .. if there is such a libary/framework as in my summary it is (should be) easily integrated in whatever UI you like. (e.g. the GNOME and KDE guys are quite good at this)
Werner
PS: If s/b wants the source of the pdf (openoffice draw) some time just ask.