Menu

#33 OCR text recognition

closed
nobody
None
1
2023-07-26
2007-04-07
No

Ok, I realize this is probably crazy, but who knows ...

One feature I really like about Acrobat is the OCR feature. I often have PDFs of scanned text. Acrobat can do OCR on them and add the text as an "invisible" layer to the original PDF. So you end up with a scanned PDF which is, at the same time, searchable and where you can copy text.

There are several open source OCR projects. I don't know anything about their qualities, but I just list them here:

Tesseract http://sourceforge.net/projects/tesseract-ocr
Orcad http://www.gnu.org/software/ocrad/ocrad.html
GOCR aka JOCR, http://www-e.uni-magdeburg.de/jschulen/ocr/
Clara http://www.claraocr.org/

Discussion

  • Christiaan Hofman

    • priority: 5 --> 3
     
  • Christiaan Hofman

    Logged In: YES
    user_id=1162009
    Originator: NO

    Orcad & GOCR/JOCR are GPL, so incompatibble with Skim's BSD licence.

    Clara seems to be just a list of OCR and related projects, many of them commercial. The quick look I took over it didn't give me much info (it's the worst site ever, it's not clear to me what they want to convey there).

    As for Tesseract, I saw several posts noting that it doesn't compile on Mac OSX. Couldn't find much documentation on their site. But it sems to be the best around (even though lacking).

    Anyway, this seems like a lot of work. One problem is that even if we were able to incorporate such a tool, we don't have too much control over the view layout, as PDFKit does that for us, and is far from transparent. So we cannot (easily) add views to overlay the PDF like Acrobat does.

     
  • Nobody/Anonymous

    Logged In: NO

    I readily support this feature, too. DevonThink Office Pro does this and it is a godsend for those of us that often get scanned docs.

     
  • Christiaan Hofman

    • priority: 3 --> 2
     
  • Christiaan Hofman

    • priority: 2 --> 1
     
  • Nobody/Anonymous

    Logged In: NO

    Ocropus seems to be a promising OSS OCR project. It is a layout analysis engine which supports multiple character recognition engine (right now, they only have the Tesseract plugin). The code is released under APL 2.0 and the project is very active. In addition, it compiles on OS X Leopard (http://groups.google.com/group/ocropus/msg/1c398cbf593105a9)

     
  • Nobody/Anonymous

    Logged In: NO

    For more flexible PDF handling, have a look at libharu: http://libharu.sourceforge.net/ It is licensed under the ZLIB/LIBPNG license, which is compatible with the BSD license.

     
  • Peter

    Peter - 2008-06-11

    Logged In: YES
    user_id=2114295
    Originator: NO

    I fully second this comment. In fact, I came on these boards with the explicit purpose of requesting this. Currently, I have to use Acrobat to perform any sort of OCR. I get a LOT of pdfs from my library that they scan but don't bother to do an OCR scan on and since I take all my notes in Skim, it only makes sense to include an OCR option. Pretty please?

     
  • Christiaan Hofman

    Logged In: YES
    user_id=1162009
    Originator: NO

    For the record: it's almost certainly impossible to offer an OCR feature. Sorry.

     
  • bjbook

    bjbook - 2008-07-01

    Logged In: YES
    user_id=1217079
    Originator: NO

    One more "me too". Built-in OCR, even if less precise than dedicated OCR, is an incredibly powerful tool for PDF usage.

    Another potential codebase to operate with could be DigitEyeOCR.

    <http://www.digiteyeocr.free.fr/downloads.php?lang=en>

     
  • Christiaan Hofman

    • status: open --> closed
     
  • sf221sf221

    sf221sf221 - 2023-07-26

    Can the impossibility of this request be revisited in view of macOS' Live Text feature that was introduced in Monterey? A project mentioned in that link uses the VNRecognizeTextRequest API.

     

    Last edit: sf221sf221 2023-07-26
    • Christiaan Hofman

      No.This is a question for Apple to apply in PDFKit.

       

Log in to post a comment.

MongoDB Logo MongoDB