Tesseract OCR / Discussion / Open Discussion: possible format for coordinate information

Brewster Kahle - 2007-01-11

At the Internet Archive we have been using the djvu.xml format to represent words on documents with coordinates. We have lots and lots of books in this format (over 100k) so this might be useful for others in training etc. (most of our books have been ocr'ed with abbyy). For example this book:
http://www.archive.org/details/owlandpussycat00leariala
http://www.archive.org/download/owlandpussycat00leariala/owlandpussycat00leariala_djvu.xml

If you are working on coordinate output from this program, I hope you will consider using this format.

If folks want help with this, please let us know as we are interested.

-brewster
brewster@archive.org

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nathan - 2007-01-11
  
  We are very interested in co-ordinate output but on a character level rather than a word level. We are wiling to pay for someone to help us get co-ordinate output out of Tesseract OCR but have not been able to do it yet. What engine are you using currently?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Brewster Kahle - 2007-01-11
  
  we have been using abbyy's character based ocr format, but it is so verbose it is crippling.
  
  if you look in our directories, such as http://www.archive.org/download/owlandpussycat00leariala/ you will see a compressed abbyy file.
  
  We will let this forum know if we make any progress on this piece.
  
  -brewster
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- JetsoftDev.com - 2007-04-08
  
  Here is a windows dll that will give you coodinates:
  
  www.scanhelp.com/pfile/tessdll.zip
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

possible format for coordinate information

Commercial quality OCR.

Forums

Help

possible format for coordinate information

possible format for coordinate information

Commercial quality OCR.

Forums

Help

possible format for coordinate information document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

possible format for coordinate information