pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (but no editable text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.

pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, unpaper, tesseract, gs, and hocr2pdf (if tesseract < 3.03). It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

In contrast to most competing sandwich programs, it performs preprocessing of the scanned images, such as de-skewing or removal of dark edges etc.

For further information please read the manual: http://www.tobias-elze.de/pdfsandwich/index.html

Project Activity

See All Activity >

License

GNU General Public License version 2.0 (GPLv2)

Follow pdfsandwich

pdfsandwich Web Site

Other Useful Business Software
Migrate to innovate with Red Hat Enterprise Linux on Azure Icon
Migrate to innovate with Red Hat Enterprise Linux on Azure

Streamline your IT modernization journey with a holistic environment running Red Hat Enterprise Linux on Azure.

With Red Hat Enterprise Linux on Azure, businesses can confidently modernize their IT environment, knowing they don’t have to compromise on security, scalability, reliability, and ease of management. Securely accelerate innovation and unlock a competitive edge with enterprise-grade modern cloud infrastructure.
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
6
1
0
0
1
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 4 / 5

User Reviews

  • The program requires Tesseract too. There are a bunch of dependency files as well. A quick Google locates them easily. I was able to go from downloading PDFsandwich to running it on the first PDF in 2 hours. It works perfectly without a single hassle. I have it running on a Raspberry Pi 3 B+ with a 32GB micro SD chip. I have about 50 files of various sizes to process. I wrote a script to have PDFsandwich process each file and then upload the searchable file to a Dropbox directory. So far, no failures, just searchable PDFs. Thank you. This utility has saved me so much time and grief.
  • I have been looking for a long time for this exact utility. I very often have a need to convert pdf files to a searchable format. Thank you for putting this together.
  • Pdfsandwich does exactely what I always was missing in Tesseract. Great lilttle piece of software with many good ideas!
  • I have been looking for something like this
  • Excellent tool. It did exactly what I wanted - performing OCR on a PDF that I had scanned and creating a new PDF with the original image but also text that could be searched and cut/pasted. It required absolutely no effort to configure or operate.
    1 user found this review helpful.
Read more reviews >

Additional Project Details

Operating Systems

Linux, BSD

Languages

English

Intended Audience

End Users/Desktop

User Interface

Command-line

Programming Language

OCaml (Objective Caml)

Related Categories

OCaml (Objective Caml) Business Software, OCaml (Objective Caml) Command Line Tools, OCaml (Objective Caml) OCR Software

Registered

2012-05-13