Tibet - Table Structure Recognition Library
Copyright (C) 2007 Haim Cohen
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Contact Information : Haim Cohen - haimico <at> gmail <dot> com
Tibet is a library for table structure extraction. It was built as the final
class project for CS6998 - Search Engine Technology - at SEAS, Columbia
University at the City of New York.
== BUILDING ==
Change directory into the main directory - tibet - and run make.
There are 3 sub directories:
src - source code for the library
test - regression tests directory
util - utilities that shows how to use the library
Regression tests use the simple 'extract' utility that reads a table into the standard
input and prints its structure to the standard output.
== Example ==
test/test1.tab contains the following table:
Green | 10 | 2.5
Apple | |
Banana | | 0.5
Orange |1 |1.25
Lemon | 4| 0.8
Using the library with 'extract' is pretty straightforward:
cat test/test1.tab | ./util/extract
The output is the table structure, formatted with new lines and delimiters.
== Algorithm ==
Table extraction combines little of both image processing and information retrieval.
-- Loading the raw table --
The table is read into a grid of characters, so we can treat each character in a way similar to
a pixel in an image.
-- Finding table boundaries --
For each row and column in the grid, we create a histogram that tells the characters
distribution among "spaces", "delimiters" and "other characters" .
Left most / right most columns and upper most / lower most rows with "other characters" percentage
higher than or equal to 15% are considered to be the table boundaries.
-- Finding frames --
Once the table boundaries are determined, the rows and columns histograms are recalculated
over the table only.
Each row/columns with more than or 30% delimiters is considered to be a table frame.
Frame may or may not exist on the table boundaries.
-- Extract cells contents --
After identifying the table frames, cells are extracted based on the location of the frames.
== Limitations ==
Currently delimiters must be used in the table. Some tables which don't use delimiters
for their columns or rows, won't be extracted properly.
The library currently doesn't recognize special cells as columns / rows headers, and just
treat them as any other cells.