SourceForge has been redesigned. Learn more.
Close

V0.03 of docs is out

2006-12-01
2013-04-25
  • Filip Gieszczykiewicz

    Please delete the old 0.01 version, everything has changed.

    I did not have time to put it up on the web, YET, so you'll have
    to play the same game as last time:

    1) Open http://www.repairfaq.org/cgi-bin/United.cgi
    2) Click on button for list of "all FAQ's available"
    3) Pick "Doxyfied Tessearct-1.02 documentation" & click DOWNLOAD
    4) enter number
    5) ONLY download from RFO mirror works

    * I will try to get a patch against 1.02. It will be big as just about all
    the comments in the source have been doxyfied.
    * We have the beginnings of a glossary.
    * Almost all the classes are documented. But I'm always updating whatever I
    happen to be hacking, so expect these to change for the better.
    * I will upload the contents of the testing/ directory.

    I will try to get it online. Here's the main page, as a teaser :-)

                                   Hacking Tesseract V0.03

                                             1.02

    Introduction to hacking Tesseract v1.02

         * First,  the  tesseract  1.02  source  tree includes A LOT of code that is not
           needed  to  carry out OCR from the sense of an end-user. For example, code is
           included for:

        1. Displaying  the  OCR  process for user and interacting with user. This is not
           working right now. We really need a developer familiar with wxWindows to have
           a  look  and  beat the windowing functions to work with wxWindows. This would
           add  a lot of 'appeal' to tesseract! In addition, there has been confirmation
           that the API works (patches have been posted).
        2. Adaptive  matcher and training code. I am under the impression that Ray Smith
           is currently working on the training code. Due to how character templates are
           used   in   the   recognition   process,   before  any  lanuages  other  than
           English/ASCII, Mr. Smith will need to complete his work.
        3. There  also  appears  to  be code from previous 'generations' of tesseract or
           maybe from future version that never got completed?

         * Second, the features and their extraction from blobs (and ONLY that part) are
           covered  under  http://www.freepatentsonline.com/5237627.html  This link will
           give  you  all  the  gory  details,  or, if you're off-line, it's under docs/
           directory  as FeatureExtraction_patent_5237627.pdf. The down-side is that the
           document  is  written in legalese/patenteese which makes it a tedious read. I
           have  tried  to  reference  columns  and rows from the patent in the comments
           within the sources. Please do the same if you can!

         * Third,   the   sources  have  been  marked  up  with  http://www.doxygen.org/
           compatible comments. I did this with several hacked perl scripts which turned
           existing  C++  comments into something doxyen likes. I also marked up by hand
           some   of  the  functions  I  was  trying  to  understand.  Please  add  your
           documentation this way too.

         * Fourth,  the  Glossary needs some work. The page you're reading now come from
           tesseractmain.cpp in ccmain/

      How Tesseract Works: What's going on

       I  encourage  you  to keep the following list in mind when doing your own hacking
       and  help  me  add  more  relevant  details.  If  possible,  please  reference  a
       TEXT_VERBOSE letter or provide function(s) doing key work.

       READING INPUT
         * Lines are read in from scanned image, in edge detection, e

       EDGE DETECTION/OUTLINES
         * Black pixels are split into blobs, aka edge detection, e
         * Blobs are processed to extract outlines, in edge detection, e

       LINES/SKEW
         * Lines are derived from strings of blobs with outlines, l
         * Gradient/rotation of page is calculated, q
         * Lines are adjusted for skew, m
         * Final touches on assigning blobs, now that lines KNOWN, underlines: u

       WORDS/SEGMENTER
         * Higher-level procedure to order blobs into words, j
         * Blobs in lines are segmented into words, t
         * Fine-tuning of vertically seams/splits between some blobs, spacing: s

       CLASSIFICATION
         * Classification of features in letters of all words performed, o
         * Words are checked in dictionary and permuter to improve them, p
         * Play with xht (height of letter 'x') for words, h
         * Words are fitted to lines and assigned to rows that fit them best, r

       QUALITY
         * Quality of words and letters is checked, v and y

       WRITING OUTPUT
         * Words are written out to .txt file, w

    Entry Points

         * main() of tesseract
         * tess_api API of tesseract
         * recog_all_words Does the multi-pass recognition of words
         * ocrshell.cpp Code for the OCR side of the OCR API
         * ocrclass.h Class definitions and constants for the OCR API
         * debugwin.cpp Portable debug window class (Unix, Mac, NT)
         * werdit.cpp An iterator for passing over all the words in a document
         * charcut.cpp Code for character clipping

    Heuristics

       (just adding them here for now, will organize it later!)

         * heuristics_garbage How tess determines if something's garbage
         * fixxht.cpp Improve x_ht and look out for case inconsistencies
         * fixspace.cpp Explore alternative spacing possibilities, neato
         * docqual.cpp Document Quality Metrics
         * reject.cpp Rejection functions used in tessedit
         * improve_row_threshold Recognizing a "normal line"
         * reduced_box_for_blob Problems with reducing BB for blobs
         * choose_best_seam How seams are chosen

      Tips and Hints

       Tess has comments, sometimes in big blocks, scattered within the code. Please add
       any others you find!

         * eval_word_spacing On how word spacing is derived
         * tess_walkingblobs How tess decides which blobs to reject
         * imgscale.cpp Dynamic programming for smart scaling of images
         * set_unlv_suspects On setting up rejection
         * tess_list_functions Info on how lists are used
         * tess_endian_note Note on big-endian vs little-endian
         * tess_transitions  A  note  on  transitions,  Recognising  transitions between
           bands, and Find end of region containing band
         * tess_gap_map Block gap map
         * tess_fuzzy_spaces Handling fuzzy spaces
         * rejctmap.h Notes on operating on reject map

      Segmenters

         * tessbox.cpp Resaljet
         * control.cpp Adaptation and Rejection control
         * recog_word Recognize words

       Be sure to check out the link to "Related Pages" (in left frame).

    Working out how Tesseract works

       This section lists the sequence of events that tesseract 1.02 executes to convert
       the  input  image 'scan.tif' into the output ASCII file 'scan.txt'. If you notice
       something wrong, please post corrections on sourceforge.net.

       By  the  way,  if  you  define  TEXT_PROGRESS  you  will  get a period ('.') when
       tesseract finds a seam between words, which gives you a good idea that it DID NOT
       hang.

       If  you  ALSO  define  TEXT_VERBOSE,  key  functions  in tesseract will print one
       character  that  shows  you  what is going on, ie: what is tesseract doing at any
       point. See next section for what those letters are and what they mean.

       There  is  also a separate file that has Stack traces for some interesting/common
       functions  RUNNING,  see  How  Tesseract  Works: Procedure stack traces Procedure
       stack  traces. Together with TEXT_VERBOSE, these will give you a way to play with
       tesseract without neccessarily being a C++ wizard, per se :-)

      What do all those letters for TEXT_VERBOSE mean?

       If you define TEXT_VERBOSE in addition to TEXT_PROGRESS, instead of a period, you
       will get other letters which are defined as follows:

         * a =
         * b =
         * c =
         * d =
         * e  =  Reading  &  scanning  line  of  image  for edges, building outlines, in
           line_edges()
         * f =
         * g = Loading DAWGs ('word-dawg'+'user-dict'), in init_permute()
         * h = Playing with xht for one word, in re_estimate_x_ht()
         * i =
         * j = Arranging blobs into words, make_words()
         * k = Initializing speckle params, in InitSpeckleVars()
         * l = Assigning blobs to one line, in assign_blobs_to_rows()
         * m = Fitting LMS line to a row, in fit_parallel_lms()
         * n = Computing linespacing and offset, delete_non_dropout_rows()
         * o    =    Extracting   outlines   for   a   class   NOT   SEEN   BEFORE,   in
           ExtractOutlineFeatures()
         * p = Using DAWG to improve a word, in dawg_permute_and_select()
         * q = Computing gradient of whole page, in compute_page_skew()
         * r = Assembling recognized blobs into rows, in make_rows()
         * .   or   s   =   Found   good   seam  between  words  to  split  a  blob,  in
           attempt_blob_chop()
         * t = Finding optimal segmentation, in check_pitch_sync2()
         * u = Processing underlines, in separate_underlines()
         * v = Checking quality of words, in word_blob_quality()
         * w = Writing output of recognition, in output_pass()
         * x = Expanding rows to touch neighbors, in expand_rows()
         * y = Checking quality of characters in words, in word_char_quality()
         * z = Evaluating word spacing, in eval_word_spacing()

       Notes:
         * "o"  will  not  print for EVERY letter because tesseract only needs to see it
           once. Thus, if any letters repeat and are very similar in appearance, ie. are
           not  messed  up  in  some way by noise, an "o" will only appear for the FIRST
           occurance  of  that  letter.  ex:  "PERLLREP" palindrome will only print four
           "o"s.
         * "r" also prints a new-line (\n)
         * If  you  want  to  add  a  new character, make sure that it doesn't print too
           often,  preferably  once  per  some  logical  unit  like  "per word" or "perl
           outline".  If  we're  out  of  lower-case letters, start using upper-case but
           avoid ambiguous letters like upper-case 'i' ("I" vs "l"?), etc.
         * Because  tesseract calls functions based on 'difficulties' encountered in the
           image,  you  may  get  different set of letters for different images, but the
           overall structure should be the same.

       To give you an idea of what you'd get, below you can see what happened when I ran
       tesseract  on a file from the 'testing/' directory. I generated it with 'pbmtext'
       using  the  included  '2helvR18.bdf'  font.  Other  tools  used were pgmtopbm and
       pnmtotiff.

       The  input  text  was  the  tesseract  License (See testing/Run_Tests.sh for more
       details):

    This package contains the Tesseract Open Source OCR Engine.
    Orignally developed at Hewlett Packard Laboratories Bristol and
    at Hewlett Packard Co, Greeley Colorado, all the code
    in this distribution is now licensed under the Apache License:

    ** Licensed under the Apache License, Version 2.0 (the "License");
    ** you may not use this file except in compliance with the License.
    ** You may obtain a copy of the License at
    ** http://www.apache.org/licenses/LICENSE-2.0
    ** Unless required by applicable law or agreed to in writing, software
    ** distributed under the License is distributed on an "AS IS" BASIS,
    ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    ** See the License for the specific language governing permissions and
    ** limitations under the License.

       Again,  please  note  that  different  output  is generated using different fonts
       because   the   letters  in  the  image  will  'interfere'  differently  and  the
       word-spacing  will  differ. Also, different fonts have different features so that
       phase will also differ!

       (I wrapped the output with 'fold -w 76')

    GNU gdb Red Hat Linux (6.0post-0.20040223.19rh)
    [blah blah]

    (gdb) r
    gkTesseract Open Source OCR Engine
    Using LIBTIFF
    Opened and reading 'testing/image_2helvR18.tif'...
    Recognizing page
    eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
    eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeer
    lqmmmmmmmmmmmmmnxlmmmmmmmmmmmmmlllmmmmmmmmmmmmmummmmmmmmmmmmmjtttttttttttttt
    tttttttttttttttttttttttttttttttr
    pppoooopppoooooopppooopppppppppspppppppppppppppppppppopppooopppsppppppoor
    pppspppppppppppppppooopppppppppsppppppppppppspppspppppppppppppppoopppspppppp
    ppppppsspppr
    pppppppppsppppppppppppspppspppppppppppppppspppoopppoopppsspppppppppppppppppp
    pppr
    pppppppppppppppopppppppppppppppppppppr
    pppppppppspppppppppppppppppppppppppppopppoopppppppppor
    pppppppppopppppppppppppppppppppopppsppppppopppppppppppppppppppppr
    ppppppsoppppppppppppppppppppppppppppppppppppr
    ppppppspppppppppppppppppppppppppppppppppppppppsssspppppppppppppppppppppppppp
    ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppr
    ppppppopppoppppppppppppppppppppppppppppppsppppppspppr
    pppppppppspppppppppppppppppppppppppppppppppppppppppppppppppppooor
    ppppppoopppooppppppopppopppppppppoppppppspppspppppppppppppppr
    pppppppppppppppppppppspppppppppppppppppppppsspppr
    pppppppppppppppppphhhpppssppphpppspppsssspppppppppppppppppphhhpppsppphhpppss
    ppppppppphhpppsppphpppspppsspppppphpppspppssspppppppppppppppppphhpppsppppppp
    pphpppsssppphpppsppphpppspppsspppppphpppspppssspppppppppppppppppphpppsppphpp
    phpppssspppppphpppsppphpppssppphhhppphhppphhppphhpppssppphpppssssppphhppphhp
    ppssssppppppppphpppssssppphpppssppphhhpppssppphhppphhpppshpppppphpppssppphpp
    phpppssppphpppspppppphhhpppppphpppssppphhppphhpppshhpppsppphhpppppphpppssppp
    hhpppsppphppphpppsppppppppppppppppppppppppppppppppppppspppsppppppppppppsssss
    sssssspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
    pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppphppphhhh
    hpppsppphhhpppppphhpppsppphhppphhpppssssppppppppphhppphhpppppphpppsppphpppsp
    pphpppppphpppsppphppphhhhhhhpppppphhpppspppshhhppphpppsssssppphpppssppphhppp
    shpppssppphhppphhhpppsssppphppphppphhpppssppphhzzzpppspppspppsspppsspppssppp
    sppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
    ppppppppppppppphzzzpppspppspppspppsppppppppppppppppppppppppppppppppppppppppp
    ppppppphzzzpppspppsssppppppppppppppppppppppppppphzzzpppspppspppspppspppppppp
    pppppppppppppppppppppppppppppppppppppppphzzzzzzppppppppphzzzppphpppssppphzzz
    pppspppsspppsspppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
    ppppppppppppppppppppppppphzppppppppppppppppppppppppppppppppppppppppppppppppp
    pppppppppppppppppppppppppppppppppppppppppppphzzzpppspppspppppppppppppppppppp
    phzzzpppsppppppppppppppphzzzpppssppphzzzpppsspppppphzzzpppspppsssspppppppppp
    pppppppphzzzpppspppssssppppppppphzpppspppsssppppppppppppppppppppppppppphzzpp
    phpppssppphzzzzpppsppppppppppppppphzzzpppspppppppppppppppppppppppphzzzpppspp
    ppppppppppppppppppppppppppppppphzzzzzzpppspppspppppppppppppppppppppppppppppp
    pppppppppppppppppphzzzpppspppppppppppppppppppppppppppppppppppppppppppppppppp
    ppppppphzzvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
    vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
    vyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvyvy
    vyvyvyvyvyvyvyvyvyvyvyvyvyvya

    Program exited normally.

       BTW, I'm running an ancient Fedora 2 release; time to upgrade! :-)

       The End

     
    • Filip Gieszczykiewicz

      Can someone tell me what, in the context of tesseract, a "Pruner" is?
      For example, what does make_config_pruner() do and why?

       
      • JetsoftDev.com

        JetsoftDev.com - 2006-12-01

        I will look into this and get back with you tomorrow unless someone else answers it first.

        By the way, you documentation is very useful. Fills in some missing pieces.

         
      • Ray Smith

        Ray Smith - 2006-12-01

        make_config_pruner() is never called, so it is irrelevant.

        The class pruner is a pre-classifier that is used to create a short-list of classification candidates (pruning the possible classes) so that the full distance metric can be calculated on the short-list without taking excessive time, instead of exhaustively matching against each character possibility. The class pruner uses a faster, but approximate method of matching the features, so while it does make mistakes, the mistakes are rare.

         
    • Filip Gieszczykiewicz

      I didn't see make_config_pruner() called, either, but there are some structures that
      refer to it. Thanks for the answer, it resolves some fuzzy areas in the code for me.

       
    • Filip Gieszczykiewicz

      Actually, I have a bunch more pointed questions. I won't ask them all at the same time,
      don't worry :-)

      I have two questions. First, what is a "configuration", a la ConvertConfig(), and where does it fit into the scheme of things.

      I am trying to understand which parts of tesseract are for the recognition-end (ie. what an end-user might interact with) and design-end (training, adaptation, etc.). Sometimes it's hard for me to see which category any particular function falls under - without firing up gdb and putting a break on it :-) (but that has its own set of problems).

      The second question is easy. Is the Starbase side FUNCTIONAL or was it in the process of development. That is, if I were to tie in a sb server clone, would tesseract fire up windows and show me various stages? (say, segmentation) I do suspect some defines need to be tweaked to get it to fire up.

      Thank you!

       
    • Drtungbo

      Drtungbo - 2007-02-09

      Cannot download this document anymore, all links were died. Please reupload this document.
      Thanks !

       

Log in to post a comment.