Add text bounding box detection

Status: Pre-Alpha

Brought to you by: kale4, yet

#1 Add text bounding box detection

Status: open

Owner: yet

Labels: Algorithm improvement (1)

Priority: 5

Updated: 2005-03-31

Created: 2005-03-31

Creator: yet

Private: No

Assuming that the page does not have to be split, add
an algorithm for detection of the text bounding box.
Basically, the idea of the algorithm is the following:

1) Radon transform yields not just the skew angle, but
also the projection of image along the text lines. This
projection can be interpreted as the one dimensional
"text signal" and any part of the image which is not
correlated with the signal should be considered as
noise and filtered out.

2) The filer should also trim low spatial frequencies
in the direction perpendicular to the text lines.

3) Filtering consists of multiplying the image by the
filter function and integration of the result in the
direction perpendicular to the text lines. One expects
that the signal coming from the text will be enhanced
and that of the vertical strips on the sides of the
page - suppressed.

4) Once the filtering is done, we need a reasonable
heuristics to find the horizontal coordinates of the
bounding box.

5) The vertical coordinates of the bounding box are
trickier to find (because the horizontal black strips
resemble legitimate features of a printed text).
Basically, I would propose just removing black
connected components, adjacent to the horizontal edges
of the image. Probably, 8-connectivity is more
appropriate than 4-connectivity for this task.

This algorithm requires the following missing utilities:
a) Connected component analysis
b) Basic 2D linear algebra (vector and scalar products)

Add text bounding box detection

Group

Searches

Help

#1 Add text bounding box detection

Discussion