OCR text recognition

A PDF Reader and Note-taker for OS X

Brought to you by: amaxwell, hofman, mmcc

#33 OCR text recognition

Status: closed

Owner: nobody

Labels: None

Priority: 1

Updated: 2023-07-26

Created: 2007-04-07

Creator: Simon Spiegel

Private: No

Ok, I realize this is probably crazy, but who knows ...

One feature I really like about Acrobat is the OCR feature. I often have PDFs of scanned text. Acrobat can do OCR on them and add the text as an "invisible" layer to the original PDF. So you end up with a scanned PDF which is, at the same time, searchable and where you can copy text.

There are several open source OCR projects. I don't know anything about their qualities, but I just list them here:

Tesseract http://sourceforge.net/projects/tesseract-ocr
Orcad http://www.gnu.org/software/ocrad/ocrad.html
GOCR aka JOCR, http://www-e.uni-magdeburg.de/jschulen/ocr/
Clara http://www.claraocr.org/

Discussion

Christiaan Hofman - 2007-05-18

priority: 5 --> 3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Hofman - 2007-05-24

Logged In: YES
user_id=1162009
Originator: NO

Orcad & GOCR/JOCR are GPL, so incompatibble with Skim's BSD licence.

Clara seems to be just a list of OCR and related projects, many of them commercial. The quick look I took over it didn't give me much info (it's the worst site ever, it's not clear to me what they want to convey there).

As for Tesseract, I saw several posts noting that it doesn't compile on Mac OSX. Couldn't find much documentation on their site. But it sems to be the best around (even though lacking).

Anyway, this seems like a lot of work. One problem is that even if we were able to incorporate such a tool, we don't have too much control over the view layout, as PDFKit does that for us, and is far from transparent. So we cannot (easily) add views to overlay the PDF like Acrobat does.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2008-03-12

Logged In: NO

I readily support this feature, too. DevonThink Office Pro does this and it is a godsend for those of us that often get scanned docs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Hofman - 2008-03-12

priority: 3 --> 2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Hofman - 2008-04-06

priority: 2 --> 1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2008-06-06

Logged In: NO

Ocropus seems to be a promising OSS OCR project. It is a layout analysis engine which supports multiple character recognition engine (right now, they only have the Tesseract plugin). The code is released under APL 2.0 and the project is very active. In addition, it compiles on OS X Leopard (http://groups.google.com/group/ocropus/msg/1c398cbf593105a9)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2008-06-06

Logged In: NO

For more flexible PDF handling, have a look at libharu: http://libharu.sourceforge.net/ It is licensed under the ZLIB/LIBPNG license, which is compatible with the BSD license.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peter - 2008-06-11

Logged In: YES
user_id=2114295
Originator: NO

I fully second this comment. In fact, I came on these boards with the explicit purpose of requesting this. Currently, I have to use Acrobat to perform any sort of OCR. I get a LOT of pdfs from my library that they scan but don't bother to do an OCR scan on and since I take all my notes in Skim, it only makes sense to include an OCR option. Pretty please?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Hofman - 2008-06-11

Logged In: YES
user_id=1162009
Originator: NO

For the record: it's almost certainly impossible to offer an OCR feature. Sorry.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bjbook - 2008-07-01

Logged In: YES
user_id=1217079
Originator: NO

One more "me too". Built-in OCR, even if less precise than dedicated OCR, is an incredibly powerful tool for PDF usage.

Another potential codebase to operate with could be DigitEyeOCR.

<http://www.digiteyeocr.free.fr/downloads.php?lang=en>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Hofman - 2008-11-03

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sf221sf221 - 2023-07-26

Can the impossibility of this request be revisited in view of macOS' Live Text feature that was introduced in Monterey? A project mentioned in that link uses the VNRecognizeTextRequest API.

Last edit: sf221sf221 2023-07-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Christiaan Hofman - 2023-07-26
  
  No.This is a question for Apple to apply in PDFKit.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.