OCR_Manager

OCR Manager -- Convert Images to Searchable PDF

Written by

Barry Stanly

Printed on November 30, 2023

Copyright 2021 by Barry Stanly

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available at: https://www.gnu.org/copyleft/fdl.html.

STF (documentation system) is hosted at https://sourceforge.net/p/simpletextformatter

The PDF version of this document may be downloaded from Source Forge:
https://sourceforge.net/projects/windows-power-utilities/ files/OCR_Manager/OCR_Manager.pdf/download

Because of the limitations of the Wiki formatter, the font size has been increased to make this document easier to read. The PDF version should be used if the font seems too big.


Preface

Preface

OCR_Manager went through an evolution in that originally it was set up to have an installer that did everything: It would install all the scripts, set up the links, install a help file, and install an un-installer. It turns out that embedding the help file (in pdf format) in the installer, and embedding the un-installer (an exe file) in the installer, triggered a typical PC's antivirus program.

There are two ways to mitigate this problem: 1) Buy a software certificate and go through a certification process, or 2) use a less ambitious install program. The former costs money (and effort), the latter is much simpler, so I opted for the latter.

The installer now installs the scripts and links, and a web reference to a wiki help file (not as nice as PDF, but adequate) and the un-installer is left as a separate download. If you want to un-install OCR_Manager, download and run the un-installer.

Barry Stanly
Henderson NV, 2021


C O N T E N T S

1 Introduction

2 Installation
2.1 Recommended Installation Order

3 Usage
3.1 Examples
3.2 In Case Of Trouble
3.3 Installation Problems
3.4 Adjusting the Windows' Environment
3.5 References

4 License

T A B L E S

Table 3-I Supported Actions
Table 3.4-II Windows' OCR Environment Variables

F I G U R E S

Figure 3-1 W10:Converting the Clipboard to Searchable PDF
Figure 3-2 W11:Converting the Clipboard to Searchable PDF



1 Introduction

This document describes the OCR_Manager which is a set of scripts for converting bit map text images to searchable PDF. Utilities include clipboard image to searchable PDF, image PDF to searchable PDF, and graphical images to searchable PDF and corresponding utilities to convert images to plain text files.

Image PDF means that text characters are in graphical format and cannot be extracted as text; searchable PDF means that text characters are stored in character form and can be extracted as text. So, for example, to convert a screen image, containing text characters to readable text, copy the desired section to the clipboard, and use OCR_Clip2Pdf to convert it to actual text.

The heart of the process is Tesseract-OCR, which is a free public domain Optical Character Reader (OCR) program hereinafter referred to as TOCR. TOCR seems to perform best in creating searchable PDF rather than plain text, so even though utilities are included to create plain text, searchable PDF is recommended. To obtain plain text from PDF, save the resulting PDF file in text form or use copy paste to obtain the desired machine readable text.

OCR_Manager installer may be downloaded from Source Forge:
https://sourceforge.net/projects/windows-power-utilities/ files/OCR_Manager/

  • Install_OCRmanager.exe is the installer. Download and run it to install OCR_Manager.

  • OCR_Manager.pdf is a PDF version of this document. Read this document prior to installing OCR_Manager.

  • OCR_ManagerSource.zip is the source code and scripts necessary to build the OCR_Manager installer.

  • UnInstall_OCRmanager.exe is the OCR Manager un-installer -- download and run this program to un-install the OCR Manager.


2 Installation

There are three parts to installing the OCR Manager:

  1. Ghostscript. Ghostscript(1), also free, is used to convert image PDF to Tag Image File Format (TIFF). TIFF is processable by TOCR, while image PDF is not. So the first step in converting image PDF to text is to convert it to TIFF.

  2. Tesseract-OCR (TOCR). TOCR is the program that converts bit mapped text images to text.

  3. OCR Manager. The OCR Manager is a set of scripts that manage the conversion process. It is set up to operate with a prearranged file structure. The details of how it operates are described in Windows Power Tips. The first few chapters of Windows Power Tips are of interest to general audiences the rest is mostly of interest to developers and power users.


2.1 Recommended Installation Order

The first step is to install GhostScript and TOCR.

Ghostscript may be downloaded from
https://ghostscript.com. Non commercial users should use the GNU Affero General Public License. Follow the installation instructions from the site. A free version of Ghostscript is available for commercial users on Linux platforms, otherwise license the Windows version from Artifex Software Inc (or use a different program to convert image PDF to TIFF format.)

TOCR may be downloaded from https://github.com/UB-Mannheim/tesseract/wiki Follow the installation instructions from the site.

The second step is to install the OCR Manager

Install_OCRmanager.exe is an installer that does all the setup, creates the necessary folders, and installs the OCR Manager scripts. Double-Click on Install_OCRmanager.exe after the previous two programs have been installed. If there are problems, see paragraph 3.3.

Install_OCRmanager saves the scripts in %APPDATA%\NIC\OCR_Manager. It also adds NIC_OCR and corresponding entries to the start menu with corresponding short cuts added to the SendTo menu(2)


3 Usage

After all the installations are completed, the OCR Manager is invoked by selecting a file and right-clicking on it and using the send-to Windows option to send the file to the OCR Manager. See Figure 3-1 for an example.

To OCR the clipboard, right-click on any file and choose OCR_Clip2Pdf. The converted file will have the same name as the chosen file with an appended "c.pdf" on the end.


Figure 3-1 W10:Converting the Clipboard to Searchable PDF


Figure 3-2 W11:Converting the Clipboard to Searchable PDF


Table 3-I Supported Actions

Action Script Comment

Image PDF to Searchable PDF

OCR_iPdf2sPdf

 

Clipboard to Searchable PDF

OCR_Clip2Pdf

 

Clipboard to Text

OCR_Clip2Txt

 

Image to Searchable PDF

OCR_Image2Pdf

 

Image to Text

OCR_Image2Txt

 

Image set to PDF

OCR_ImageSet2Pdf

All files in the folder are processed

Image set to Text

OCR_ImageSet2Txt

All files in the folder are processed


3.1 Examples

  1. Convert JPEG file to PDF: OCR_Image2Pdf File.jpg
    Creates Filei.pdf

  2. Convert PNG file to PDF: OCR_Image2Pdf File.png
    Creates Filei.pdf

  3. Convert several graphic files to PDF: OCR_ImageSet2Pdf File.png
    Creates a set of PDF files, one for each file in the folder, i.e. all files are assumed to be graphic files (different types, JPG, PNG, etc. may be intermixed.) The PDF files may be combined into a single PDF document using PDF Merger & Splitter free App which may be installed from the Windows store.

  4. Convert image PDF files to Searchable PDF: OCR_iPdf2sPdf File.pdf
    Creates Files.pdf. If an error message is generated, split the PDF file into two equal parts and run OCR_iPdf2sPdf on each half. Continue splitting the file until it runs through without error. An actual case history of a 600 page image PDF technical book, required four splittings into four files of 150 pages each before the files were small enough to be processed without error.

The Windows store contains the PDF Merger & Splitter free App that was used to do the splitting and recombining.

  1. Convert non text images to image PDF.
    It turns out that images that do not contain any text get converted image PDF. So if you have an image (or a set of images) that you wish to convert to PDF, run OCR_Image2Pdf (or OCR_ImageSet2Pdf) and each image will have its own corresponding PDF file.


3.2 In Case Of Trouble

TOCR works well in converting bit mapped images to PDF. It does seem to work better if there is white space around the text to be converted, i.e. try to use margins at least as wide as a typical character. Also sometimes it helps to enlarge the image in order for OCR recognition to take place, i.e. large size letters work better than small size letters. Also highlighting can confuse the OCR causing the text to remain as an image.



3.3 Installation Problems

  1. Install_OCRmanager.exe may be falsely blocked by your anti-malware program. When this happens restore Install_OCRmanager.exe and declare it as not malware (or disable your anti-malware app for the installation and then restore it.)

  2. TOCR or Ghostscript may also be blocked by your anti-malware app. When this happens it is necessary to declare TOCR or Ghostscript as not malware or you won't be able to convert image text to searchable PDF.

  3. The location at which TOCR is stored may change across versions. This is unusual in that the TOCR configuration seems to be relatively stable. But if it does move, the Windows' Environment will have to be updated to point to the new location.

  4. Ghostscript changes its location with each new version. So it is necessary to adjust the Windows' Environment with each upgrade. Install_OCRmanager is usually successful in making the necessary adjustments, so if you upgrade Tesseract-OCR or Ghostscript and the OCR_Manager starts failing, try reinstalling the OCR_Manager. See paragraph 3.4 for information on how to manually make changes to the Windows' Environment.


3.4 Adjusting the Windows' Environment

The Windows' Environment is a keyed, in memory, database that holds information on how Windows is used. There are five Environment variables used by OCR_Manager, see table 3.4-II.

Table 3.4-II Windows' OCR Environment Variables

Variable Typical Value Comment

OCR_PGM
OCR_DATA
OCR_GS
TEMP
OCR_Scripts

C:\Program Files\Tesseract-OCR\tesseract.exe
C:\Program Files\Tesseract-OCR\tessdata
C:\Program Files\gs\gs9.53.3\bin\gswin64c
%LOCALAPPDATA%\Temp
%APPDATA%\NIC\OCR_Manager

Tesseract program
Tesseract data folder
Ghostscript program
Temporary file folder
OCR_Manager scripts

The Windows' Environment may be manually edited by entering Environment in the search window (at the bottom of the screen) and choosing edit the system environment variables. There are two sets of environment variables. One for User space and one for System space. OCR_Manager normally uses the User space.


3.5 References

Sometimes it is desirable to modify the OCR Manager script's. When doing so, the following reference manuals may prove helpful. The scripts may be located from the start menu under NIC_OCR.

  1. Tesseract-OCR User Manual -- https://tesseract-ocr.github.io/tessdoc/

  2. Ghoscript User Manual -- https://ghostscript.com/


4 License

These scripts and the installer program are free software: you can redistribute them and/or modify them under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

These scripts and the installer program are distributed in the hope that they will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

A copy of the GNU Lesser General Public License may be downloaded from
http://www.gnu.org/licenses.

F O O T N O T E S



(1) 
Ghostscript is free for non commercial use. Commercial users should either license it or change the script to use Imagemagick. Imagemagick can also convert PDF to Tiff format.

(2) 
The Windows Environment, Start Menu, and Send To menus are described in Windows Power Tips.


Related

Wiki: Home