Table Structure Recognition Library Code
Status: Beta
Brought to you by:
haimcohen
File | Date | Author | Commit |
---|---|---|---|
src | 2007-04-15 | haimcohen | [r17] Adding more test cases |
test | 2007-04-20 | haimcohen | [r21] more test cases |
util | 2007-04-08 | haimcohen | [r15] Adding the LGPL license |
COPYING | 2007-04-08 | haimcohen | [r15] Adding the LGPL license |
Makefile | 2007-04-15 | haimcohen | [r19] Adding 'rel' target for release file |
README | 2007-04-15 | haimcohen | [r18] backup |
================================================================================ Tibet - Table Structure Recognition Library Copyright (C) 2007 Haim Cohen This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Contact Information : Haim Cohen - haimico <at> gmail <dot> com ================================================================================ Tibet is a library for table structure extraction. It was built as the final class project for CS6998 - Search Engine Technology - at SEAS, Columbia University at the City of New York. https://sourceforge.net/projects/tibet/ == BUILDING == Change directory into the main directory - tibet - and run make. There are 3 sub directories: src - source code for the library test - regression tests directory util - utilities that shows how to use the library Regression tests use the simple 'extract' utility that reads a table into the standard input and prints its structure to the standard output. == Example == test/test1.tab contains the following table: ------------------ Green | 10 | 2.5 Apple | | ------------------ Banana | | 0.5 ------------------ Orange |1 |1.25 ------------------ Lemon | 4| 0.8 Using the library with 'extract' is pretty straightforward: cat test/test1.tab | ./util/extract The output is the table structure, formatted with new lines and delimiters. Green Apple|10|2.5 Banana||0.5 Orange|1|1.25 Lemon|4|0.8 == Algorithm == Table extraction combines little of both image processing and information retrieval. -- Loading the raw table -- The table is read into a grid of characters, so we can treat each character in a way similar to a pixel in an image. -- Finding table boundaries -- For each row and column in the grid, we create a histogram that tells the characters distribution among "spaces", "delimiters" and "other characters" . Left most / right most columns and upper most / lower most rows with "other characters" percentage higher than or equal to 15% are considered to be the table boundaries. -- Finding frames -- Once the table boundaries are determined, the rows and columns histograms are recalculated over the table only. Each row/columns with more than or 30% delimiters is considered to be a table frame. Frame may or may not exist on the table boundaries. -- Extract cells contents -- After identifying the table frames, cells are extracted based on the location of the frames. == Limitations == Currently delimiters must be used in the table. Some tables which don't use delimiters for their columns or rows, won't be extracted properly. The library currently doesn't recognize special cells as columns / rows headers, and just treat them as any other cells.