Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Best way to read forms?

2007-02-21
2013-04-25
  • Brian Michalk
    Brian Michalk
    2007-02-21

    I've been plowing through the Doxygen pages.
    I have a need to read in forms where the text I am interested in is completely bounded by a box.  The final result would be a file which reports the x/y coordinate of one of the corners of that box, and the text it contained.  With that data I can search for rows and columns ... I googled and found an example:
    http://www.quasar.ualberta.ca/edit202/tutorial/spreadsheet/ExcelBasics/ExcelBasics.htm
    Assume this was printed out, one would see a row entry of "02-May" with an x/y coordinate.  Now I know that any text with the same "y" belongs with the "02-May" line.  If one searched for "Observed", I would have the "x" addresses for that column.  Now I can take the x and y to obtain the value for "02-May" "observed."

    I was thinking about creating a list of blobs where each blob designates the inside of a box.  The x/y locations are known for each blob.  Next, find all BOXes that are within that each blob and output the coordinates with the text.

    Am I on the right track?  Is there a better way to do this?  I know only a little about OCR.

     
    • Key question: Are the boxed in *FIXED* locations on the form? (mine are, on about 4 dozen forms :-)

      If so, you can use pamcut (from http://netpbm.sourceforge.net/\) to extract just that box and feed that to tess. Same thing as you're talking about except at what stage does the boxing happen :-)

      I've been using this with GREAT results - the locations of the boxes are often a "signature" of the form so that automates a lot of other things.

      Cheers,
      Fil

       
      • Brian Michalk
        Brian Michalk
        2007-02-21

        Interesting idea.  Some of the forms are standardized.  One would think that pulling out the boxes would be fairly simple.

         
    • I should qualify that a bit and say that for very small boxes, tess has issues with
      'guessing' which context to choose for the final word... so I often save the 'debugging'
      output (via the config file) of what the INITIAL word it thinks it sees (i.e., before it
      digs into the DAWG) and then fix that up with a perl script that runs the patterns against
      the main database (think first & last name - tess invents some interesting words otherwise...)

      Cheers,
      Fil