Menu

possible format for coordinate information

2007-01-11
2013-04-25
  • Brewster Kahle

    Brewster Kahle - 2007-01-11

    At the Internet Archive we have been using the djvu.xml format to represent words on documents with coordinates.   We have lots and lots of books in this format (over 100k) so this might be useful for others in training etc.  (most of our books have been ocr'ed with abbyy).  For example this book:
    http://www.archive.org/details/owlandpussycat00leariala
    http://www.archive.org/download/owlandpussycat00leariala/owlandpussycat00leariala_djvu.xml

    If you are working on coordinate output from this program, I hope you will consider using this format.

    If folks want help with this, please let us know as we are interested.

    -brewster
    brewster@archive.org

     
    • Nathan

      Nathan - 2007-01-11

      We are very interested in co-ordinate output but on a character level rather than a word level. We are wiling to pay for someone to help us get co-ordinate output out of Tesseract OCR but have not been able to do it yet. What engine are you using currently?

       
    • Brewster Kahle

      Brewster Kahle - 2007-01-11

      we have been using abbyy's character based ocr format, but it is so verbose it is crippling.

      if you look in our directories, such as http://www.archive.org/download/owlandpussycat00leariala/  you will see a compressed abbyy file.

      We will let this forum know if we make any progress on this piece.

      -brewster

       
    • JetsoftDev.com

      JetsoftDev.com - 2007-04-08

      Here is a windows dll that will give you coodinates:

      www.scanhelp.com/pfile/tessdll.zip

       

Log in to post a comment.