Re: [Ankur-core] Bangla OCR progress

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 5/9/09, Debayan Banerjee <deb...@gm...> wrote:
> 2009/5/9 Deepayan Sarkar <dee...@gm...>:
>
> > Debayan,
>  >
>  > I have been meaning to ask you: is your character segmentation
>  > algorithm in a form that could be easily separated out?
>
> The segmentation algorithm can be found here
>  (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf)

But this is your original algorithm which segmented গ etc (at least
for some fonts), isn't it? I thought you had an improved algorithm
which works around some of those problems (or maybe I misunderstood
your mail).

> > If it could be
>  > easily done, I would like to try it out in BOCRA. Unfortunately, I
>  > don't think I will have enough time in the near future to figure out
>  > how ocropus/tesseract does things.
>
>
> Kindly read the paragraph in this
>
> (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html)
>
> post regarding reducing number of character classes to be trained. I
>  want to know if this is possible using BOCRA.

No it's not. From the beginning, my design for BOCRA was based on the
idea of on-the-fly training, because that's the only approach I
thought was feasible given the combination of non-standard fonts and
so many potential conjuncts. In most realistic examples, the number of
conjuncts is actually quite limited. After accounting for the most
common ones, the frequency of the rest are probably lower than normal
OCR error rate anyway.

-Deepayan