Please delete the old 0.01 version, everything has changed.
I did not have time to put it up on the web, YET, so you'll have
to play the same game as last time:
1) Open http://www.repairfaq.org/cgi-bin/United.cgi
2) Click on button for list of "all FAQ's available"
3) Pick "Doxyfied Tessearct-1.02 documentation" & click DOWNLOAD
4) enter number
5) ONLY download from RFO mirror works
* I will try to get a patch against 1.02. It will be big as just about all
the comments in the source have been doxyfied.
* We have the beginnings of a glossary.
* Almost all the classes are documented. But I'm always updating whatever I
happen to be hacking, so expect these to change for the better.
* I will upload the contents of the testing/ directory.
Hacking Tesseract V0.03
Introduction to hacking Tesseract v1.02
* First, the tesseract 1.02 source tree includes A LOT of code that is not
needed to carry out OCR from the sense of an end-user. For example, code is
1. Displaying the OCR process for user and interacting with user. This is not
working right now. We really need a developer familiar with wxWindows to have
a look and beat the windowing functions to work with wxWindows. This would
add a lot of 'appeal' to tesseract! In addition, there has been confirmation
that the API works (patches have been posted).
2. Adaptive matcher and training code. I am under the impression that Ray Smith
is currently working on the training code. Due to how character templates are
used in the recognition process, before any lanuages other than
English/ASCII, Mr. Smith will need to complete his work.
3. There also appears to be code from previous 'generations' of tesseract or
maybe from future version that never got completed?
* Second, the features and their extraction from blobs (and ONLY that part) are
covered under http://www.freepatentsonline.com/5237627.html This link will
give you all the gory details, or, if you're off-line, it's under docs/
directory as FeatureExtraction_patent_5237627.pdf. The down-side is that the
document is written in legalese/patenteese which makes it a tedious read. I
have tried to reference columns and rows from the patent in the comments
within the sources. Please do the same if you can!
* Third, the sources have been marked up with http://www.doxygen.org/
compatible comments. I did this with several hacked perl scripts which turned
existing C++ comments into something doxyen likes. I also marked up by hand
some of the functions I was trying to understand. Please add your
documentation this way too.
* Fourth, the Glossary needs some work. The page you're reading now come from
tesseractmain.cpp in ccmain/
How Tesseract Works: What's going on
I encourage you to keep the following list in mind when doing your own hacking
and help me add more relevant details. If possible, please reference a
TEXT_VERBOSE letter or provide function(s) doing key work.
* Lines are read in from scanned image, in edge detection, e
* Black pixels are split into blobs, aka edge detection, e
* Blobs are processed to extract outlines, in edge detection, e
* Lines are derived from strings of blobs with outlines, l
* Gradient/rotation of page is calculated, q
* Lines are adjusted for skew, m
* Final touches on assigning blobs, now that lines KNOWN, underlines: u
* Higher-level procedure to order blobs into words, j
* Blobs in lines are segmented into words, t
* Fine-tuning of vertically seams/splits between some blobs, spacing: s
* Classification of features in letters of all words performed, o
* Words are checked in dictionary and permuter to improve them, p
* Play with xht (height of letter 'x') for words, h
* Words are fitted to lines and assigned to rows that fit them best, r
* Quality of words and letters is checked, v and y
* Words are written out to .txt file, w
* main() of tesseract
* tess_api API of tesseract
* recog_all_words Does the multi-pass recognition of words
* ocrshell.cpp Code for the OCR side of the OCR API
* ocrclass.h Class definitions and constants for the OCR API
* debugwin.cpp Portable debug window class (Unix, Mac, NT)
* werdit.cpp An iterator for passing over all the words in a document
* charcut.cpp Code for character clipping
(just adding them here for now, will organize it later!)
* heuristics_garbage How tess determines if something's garbage
* fixxht.cpp Improve x_ht and look out for case inconsistencies
* fixspace.cpp Explore alternative spacing possibilities, neato
* docqual.cpp Document Quality Metrics
* reject.cpp Rejection functions used in tessedit
* improve_row_threshold Recognizing a "normal line"
* reduced_box_for_blob Problems with reducing BB for blobs
* choose_best_seam How seams are chosen
Tips and Hints
Tess has comments, sometimes in big blocks, scattered within the code. Please add
any others you find!
* eval_word_spacing On how word spacing is derived
* tess_walkingblobs How tess decides which blobs to reject
* imgscale.cpp Dynamic programming for smart scaling of images
* set_unlv_suspects On setting up rejection
* tess_list_functions Info on how lists are used
* tess_endian_note Note on big-endian vs little-endian
* tess_transitions A note on transitions, Recognising transitions between
bands, and Find end of region containing band
* tess_gap_map Block gap map
* tess_fuzzy_spaces Handling fuzzy spaces
* rejctmap.h Notes on operating on reject map
* tessbox.cpp Resaljet
* control.cpp Adaptation and Rejection control
* recog_word Recognize words
Be sure to check out the link to "Related Pages" (in left frame).
Working out how Tesseract works
This section lists the sequence of events that tesseract 1.02 executes to convert
the input image 'scan.tif' into the output ASCII file 'scan.txt'. If you notice
something wrong, please post corrections on sourceforge.net.
By the way, if you define TEXT_PROGRESS you will get a period ('.') when
tesseract finds a seam between words, which gives you a good idea that it DID NOT
If you ALSO define TEXT_VERBOSE, key functions in tesseract will print one
character that shows you what is going on, ie: what is tesseract doing at any
point. See next section for what those letters are and what they mean.
There is also a separate file that has Stack traces for some interesting/common
functions RUNNING, see How Tesseract Works: Procedure stack traces Procedure
stack traces. Together with TEXT_VERBOSE, these will give you a way to play with
tesseract without neccessarily being a C++ wizard, per se :-)
What do all those letters for TEXT_VERBOSE mean?
If you define TEXT_VERBOSE in addition to TEXT_PROGRESS, instead of a period, you
will get other letters which are defined as follows:
* a =
* b =
* c =
* d =
* e = Reading & scanning line of image for edges, building outlines, in
* f =
* g = Loading DAWGs ('word-dawg'+'user-dict'), in init_permute()
* h = Playing with xht for one word, in re_estimate_x_ht()
* i =
* j = Arranging blobs into words, make_words()
* k = Initializing speckle params, in InitSpeckleVars()
* l = Assigning blobs to one line, in assign_blobs_to_rows()
* m = Fitting LMS line to a row, in fit_parallel_lms()
* n = Computing linespacing and offset, delete_non_dropout_rows()
* o = Extracting outlines for a class NOT SEEN BEFORE, in
* p = Using DAWG to improve a word, in dawg_permute_and_select()
* q = Computing gradient of whole page, in compute_page_skew()
* r = Assembling recognized blobs into rows, in make_rows()
* . or s = Found good seam between words to split a blob, in
* t = Finding optimal segmentation, in check_pitch_sync2()
* u = Processing underlines, in separate_underlines()
* v = Checking quality of words, in word_blob_quality()
* w = Writing output of recognition, in output_pass()
* x = Expanding rows to touch neighbors, in expand_rows()
* y = Checking quality of characters in words, in word_char_quality()
* z = Evaluating word spacing, in eval_word_spacing()
* "o" will not print for EVERY letter because tesseract only needs to see it
once. Thus, if any letters repeat and are very similar in appearance, ie. are
not messed up in some way by noise, an "o" will only appear for the FIRST
occurance of that letter. ex: "PERLLREP" palindrome will only print four
* "r" also prints a new-line (\n)
* If you want to add a new character, make sure that it doesn't print too
often, preferably once per some logical unit like "per word" or "perl
outline". If we're out of lower-case letters, start using upper-case but
avoid ambiguous letters like upper-case 'i' ("I" vs "l"?), etc.
* Because tesseract calls functions based on 'difficulties' encountered in the
image, you may get different set of letters for different images, but the
overall structure should be the same.
To give you an idea of what you'd get, below you can see what happened when I ran
tesseract on a file from the 'testing/' directory. I generated it with 'pbmtext'
using the included '2helvR18.bdf' font. Other tools used were pgmtopbm and
The input text was the tesseract License (See testing/Run_Tests.sh for more
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Again, please note that different output is generated using different fonts
because the letters in the image will 'interfere' differently and the
word-spacing will differ. Also, different fonts have different features so that
phase will also differ!
(I wrapped the output with 'fold -w 76')
GNU gdb Red Hat Linux (6.0post-0.20040223.19rh)
gkTesseract Open Source OCR Engine
Opened and reading 'testing/image_2helvR18.tif'...
Program exited normally.
BTW, I'm running an ancient Fedora 2 release; time to upgrade! :-)
Can someone tell me what, in the context of tesseract, a "Pruner" is?
For example, what does make_config_pruner() do and why?
I will look into this and get back with you tomorrow unless someone else answers it first.
By the way, you documentation is very useful. Fills in some missing pieces.
make_config_pruner() is never called, so it is irrelevant.
The class pruner is a pre-classifier that is used to create a short-list of classification candidates (pruning the possible classes) so that the full distance metric can be calculated on the short-list without taking excessive time, instead of exhaustively matching against each character possibility. The class pruner uses a faster, but approximate method of matching the features, so while it does make mistakes, the mistakes are rare.
I didn't see make_config_pruner() called, either, but there are some structures that
refer to it. Thanks for the answer, it resolves some fuzzy areas in the code for me.
Actually, I have a bunch more pointed questions. I won't ask them all at the same time,
don't worry :-)
I have two questions. First, what is a "configuration", a la ConvertConfig(), and where does it fit into the scheme of things.
I am trying to understand which parts of tesseract are for the recognition-end (ie. what an end-user might interact with) and design-end (training, adaptation, etc.). Sometimes it's hard for me to see which category any particular function falls under - without firing up gdb and putting a break on it :-) (but that has its own set of problems).
The second question is easy. Is the Starbase side FUNCTIONAL or was it in the process of development. That is, if I were to tie in a sb server clone, would tesseract fire up windows and show me various stages? (say, segmentation) I do suspect some defines need to be tweaked to get it to fire up.
Cannot download this document anymore, all links were died. Please reupload this document.
Log in to post a comment.