Getting accurate word boundaries

Tools
2006-06-13
2013-04-29
  • Dear Tom,

    I'm using spatial indexing to organize the text extraction process using Multivalent and so the delineations of words within the pdf files is occasionally erratic. This means that whitespace is included in the word boundary (sometimes enough to carry over to the next column).

    I'm using the font information (obtained from your reply to the last post) to try to calculate this, but there are other variables involved.

    if f is the font information at the start of the word, I try to calculate the width of the string 's' using the following approach. Sometimes this approach will add a trailing space, sometimes not. Sometimes it cuts off the data. Is there additional data that am I missing?

    double wordWidth = f.getStringBounds(s).getWidth();

    Gully

     
    • Tom Phelps
      Tom Phelps
      2006-06-14

      getBounds().getWidth() does what you think.  If the spot field in Context is set, then that should be used over the usual case.  Is the node bounding box accurate?  If you think there is a bug in getStringBounds(), can you give me a test case I can run to reproduce it.

       
    • Tom Phelps
      Tom Phelps
      2006-06-14

      If all you want it the width, NFont.stringWidth() is faster.

       
    • It seems that I get errors only when the text is large and bold. I will submit a bug report.

      NFont.stringWidth() does not exist in my implementation (or it must be a private or protected method, since I can't use it from my code).

       
    • Tom Phelps
      Tom Phelps
      2006-06-20

      I received the PDF via email.  Most PDFs display the text from the font information in the file, and in that case bounding boxes should be tight.  But that PDF was created by scanning paper, and the font information is only a guess by the OCR software.  If you look at page 1 with the View/Show OCR or the OCR lens, you can see that the text displayed with fonts usually matches well with the text in the image, but in the boldface in the title it is off -- in particular it is wider.  In other words, it appears that the bounding boxes computed from the embedded font information are accurate, but this is a scanned PDF with estimated fonts and so the bounding boxes don't match with the original.

       
    • Right. That makes sense.

      So, even trying to recompute the bounding box is likely to fail if the guessed Font is incorrect?

      Would one solution be to repeat the OCR on a high resolution scan? (I know that this is computationally expensive, but I'm trying to work out a good workflow strategy for solving this problem).

      Thanks

      Gully

       
    • Tom Phelps
      Tom Phelps
      2006-06-21

      The problem is that the PDF starts as scanned paper, so the OCR software guesses at fonts.  Maybe some OCR software reports word bounding boxes directly.  Probably the least effort fix is, for cases of overlapping boxes, to trim the bounding box of the leftmost box.