We are very interested in co-ordinate output but on a character level rather than a word level. We are wiling to pay for someone to help us get co-ordinate output out of Tesseract OCR but have not been able to do it yet. What engine are you using currently?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
At the Internet Archive we have been using the djvu.xml format to represent words on documents with coordinates. We have lots and lots of books in this format (over 100k) so this might be useful for others in training etc. (most of our books have been ocr'ed with abbyy). For example this book:
http://www.archive.org/details/owlandpussycat00leariala
http://www.archive.org/download/owlandpussycat00leariala/owlandpussycat00leariala_djvu.xml
If you are working on coordinate output from this program, I hope you will consider using this format.
If folks want help with this, please let us know as we are interested.
-brewster
brewster@archive.org
We are very interested in co-ordinate output but on a character level rather than a word level. We are wiling to pay for someone to help us get co-ordinate output out of Tesseract OCR but have not been able to do it yet. What engine are you using currently?
we have been using abbyy's character based ocr format, but it is so verbose it is crippling.
if you look in our directories, such as http://www.archive.org/download/owlandpussycat00leariala/ you will see a compressed abbyy file.
We will let this forum know if we make any progress on this piece.
-brewster
Here is a windows dll that will give you coodinates:
www.scanhelp.com/pfile/tessdll.zip