Thread: Re: [Ankur-core] Bangla OCR progress

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I take the liberty of top posting since i copied the mail's contents
from archives and bottom posting will require messing with the text
below to much. In reply to this particular line:
" It takes the old "matra removal" approach, and he's
facing the same problems I did (notice in his first example that গ is
segmented into 2 parts, and শু is not)."

Kindly see http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690.

Below is the original conversation.

On 7/2/08, Golam Mortuza Hossain <[EMAIL PROTECTED]> wrote:
> On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta <[EMAIL PROTECTED]>
>
> > This guy seems to be doing some interesting progress for a Bangla OCR
>  > - or more precisely, enabling Bangla in Tesseract.
>  > http://debayanin.googlepages.com/hackingtesseract

Cool. I had some interaction with the tesseract/ocropus folks, and it
sounded like a good base. It's nice that someone's actually doing
something with it. It takes the old "matra removal" approach, and he's
facing the same problems I did (notice in his first example that গ is
segmented into 2 parts, and শু is not). On the other hand, having
something that works even partly is a good start.

> Yes, it looks definitely interesting.
>
>  > Looks like he needs some more training data - can we provide him with some
> ?
>
> If I remember correctly, there was a sample file for testing completeness
>  of Bengali fonts. Since it has all letters and conjuncts typed-in, the
>  file might
>  be useful for training Tesseract as well .
>
>  Deepayan should be able to give some input here. He has working experience
>  with R and may have some training sample as well.

Well, we have a bunch of unicode documents. For some of them, I have
print versions too, and can scan them if needed. A simpler approach
would be to render them using different fonts and take screenshots.

Apparently he also needs some box-files, whatever they are, which need
to be produced using tesseract. I haven't installed tesseract yet, and
will try, but let me know if anyone else manages.

-Deepayan

-- 
Be Intelligent, Use GNU/Linux

http://debayanin.googlepages.com/
http://debayan.wordpress.com
http://lug.nitdgp.ac.in

Thread: Re: [Ankur-core] Bangla OCR progress

bengalinux-core