|
From: Debayan B. <deb...@gm...> - 2009-04-18 23:16:30
|
I take the liberty of top posting since i copied the mail's contents from archives and bottom posting will require messing with the text below to much. In reply to this particular line: " It takes the old "matra removal" approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not)." Kindly see http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690. Below is the original conversation. On 7/2/08, Golam Mortuza Hossain <[EMAIL PROTECTED]> wrote: > On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta <[EMAIL PROTECTED]> > > > This guy seems to be doing some interesting progress for a Bangla OCR > > - or more precisely, enabling Bangla in Tesseract. > > http://debayanin.googlepages.com/hackingtesseract Cool. I had some interaction with the tesseract/ocropus folks, and it sounded like a good base. It's nice that someone's actually doing something with it. It takes the old "matra removal" approach, and he's facing the same problems I did (notice in his first example that গ is segmented into 2 parts, and শু is not). On the other hand, having something that works even partly is a good start. > Yes, it looks definitely interesting. > > > Looks like he needs some more training data - can we provide him with some > ? > > If I remember correctly, there was a sample file for testing completeness > of Bengali fonts. Since it has all letters and conjuncts typed-in, the > file might > be useful for training Tesseract as well . > > Deepayan should be able to give some input here. He has working experience > with R and may have some training sample as well. Well, we have a bunch of unicode documents. For some of them, I have print versions too, and can scan them if needed. A simpler approach would be to render them using different fonts and take screenshots. Apparently he also needs some box-files, whatever they are, which need to be produced using tesseract. I haven't installed tesseract yet, and will try, but let me know if anyone else manages. -Deepayan -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/ http://debayan.wordpress.com http://lug.nitdgp.ac.in |
|
From: Salahuddin P. <sal...@gm...> - 2009-04-19 05:14:05
|
On Apr 19, 2009, at 5:16 AM, Debayan Banerjee wrote: > I take the liberty of top posting since i copied the mail's contents > from archives and bottom posting will require messing with the text > below to much. In reply to this particular line: > " It takes the old "matra removal" approach, and he's > facing the same problems I did (notice in his first example that গ > is > segmented into 2 parts, and শু is not)." > > Kindly see http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690 > . > > Below is the original conversation. > > On 7/2/08, Golam Mortuza Hossain <[EMAIL PROTECTED]> wrote: >> On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta <[EMAIL >> PROTECTED]> >> >>> This guy seems to be doing some interesting progress for a Bangla >>> OCR >>> - or more precisely, enabling Bangla in Tesseract. >>> http://debayanin.googlepages.com/hackingtesseract > > Cool. I had some interaction with the tesseract/ocropus folks, and it > sounded like a good base. It's nice that someone's actually doing > something with it. It takes the old "matra removal" approach, and he's > facing the same problems I did (notice in his first example that গ > is > segmented into 2 parts, and শু is not). On the other hand, having > something that works even partly is a good start. > >> Yes, it looks definitely interesting. >> >>> Looks like he needs some more training data - can we provide him >>> with some >> ? >> >> If I remember correctly, there was a sample file for testing >> completeness >> of Bengali fonts. Since it has all letters and conjuncts typed-in, >> the >> file might >> be useful for training Tesseract as well . >> >> Deepayan should be able to give some input here. He has working >> experience >> with R and may have some training sample as well. > > Well, we have a bunch of unicode documents. For some of them, I have > print versions too, and can scan them if needed. A simpler approach > would be to render them using different fonts and take screenshots. > > Apparently he also needs some box-files, whatever they are, which need > to be produced using tesseract. I haven't installed tesseract yet, and > will try, but let me know if anyone else manages. > > -Deepayan > > > Dear all, I was working with OCR for my university. I took most of the idea from bocra.sourceforge.net It is written using graphicsmagick library & C++. Any suggestion from you about matching alphabet. Here is my progress.... http://picasaweb.google.com/salahuddin66/OCR# regards salahuddin salahuddin66.blogspot.com > > -- > Be Intelligent, Use GNU/Linux > > http://debayanin.googlepages.com/ > http://debayan.wordpress.com > http://lug.nitdgp.ac.in > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > Bengalinux-core mailing list > Ben...@li... > https://lists.sourceforge.net/lists/listinfo/bengalinux-core |
|
From: Debayan B. <deb...@gm...> - 2009-04-19 13:17:23
|
Dear Salahuddin, > > > I was working with OCR for my university. I took most of the idea > from bocra.sourceforge.net > > It is written using graphicsmagick library & C++. Any suggestion from > you about matching alphabet. You now need a recogniser. You could use a neural network library or an adaptive classifier. Tesseract-OCR, the one I am trying to adapt, used a neural net named aspirine/migraine previously and then switched to a nearest-neighbour based adaptive classifier engine. This switch was made due to licensing issues with aspirine i believe. The challenge ofcourse is not to build a recogniser, since you can use one of the available ones. The challenge is to gather sufficient training data, or better yet, create a tool that automatically generates training data (given a font name and size) for this OCR system using image rendering in a matter of seconds. I have been trying to do it but my initial approach was wrong. However I believe I now know the correct approach. Kindly go through http://hacking-tesseract.blogspot.com/. > -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/ http://debayan.wordpress.com http://lug.nitdgp.ac.in |
|
From: srhaque <sr...@th...> - 2009-04-19 20:41:35
Attachments:
juktakkhor.txt
|
BTW, if you still need my test file with conjunct samples, here it is... |
|
From: Debayan B. <deb...@gm...> - 2009-05-08 21:19:10
|
2009/4/20 srhaque <sr...@th...>: > BTW, if you still need my test file with conjunct samples, here it is... > Thank you very much. They have proved *very helpful* :) I preapred this (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) post with the help of your document. -- Regards, Debayan Banerjee Support Free Software http://deeproot.in |
|
From: srhaque <sr...@th...> - 2009-05-08 21:39:20
|
On Friday 08 May 2009, Debayan Banerjee wrote: > 2009/4/20 srhaque <sr...@th...>: > > BTW, if you still need my test file with conjunct samples, here it is... > > Thank you very much. They have proved *very helpful* :) > I preapred this > (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) > post with the help of your document. Cool. If it is of any use, then note that my Raga font also has glyphs for all the conjuncts (though I've not anything with the advanced tables to refine the font generally). I've been thinking about OCR for a little while too, and am doing some little experiments here and there based on trying to apply brute force to simple algorithms for deskewing/text-block extraction/segmentation. However, I'm a bit stuck for inspiration on that front for now, so if there is anything I can do to help *you*, please let me know. Thanks, Shaheed |
|
From: Deepayan S. <dee...@gm...> - 2009-05-09 02:20:44
|
Debayan, I have been meaning to ask you: is your character segmentation algorithm in a form that could be easily separated out? If it could be easily done, I would like to try it out in BOCRA. Unfortunately, I don't think I will have enough time in the near future to figure out how ocropus/tesseract does things. -Deepayan |
|
From: Debayan B. <deb...@gm...> - 2009-05-09 15:20:06
|
2009/5/9 Deepayan Sarkar <dee...@gm...>: > Debayan, > > I have been meaning to ask you: is your character segmentation > algorithm in a form that could be easily separated out? The segmentation algorithm can be found here (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf) > If it could be > easily done, I would like to try it out in BOCRA. Unfortunately, I > don't think I will have enough time in the near future to figure out > how ocropus/tesseract does things. Kindly read the paragraph in this (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) post regarding reducing number of character classes to be trained. I want to know if this is possible using BOCRA. > > -Deepayan > -- Regards, Debayan Banerjee Support Free Software http://deeproot.in |
|
From: Deepayan S. <dee...@gm...> - 2009-05-09 17:26:32
|
On 5/9/09, Debayan Banerjee <deb...@gm...> wrote: > 2009/5/9 Deepayan Sarkar <dee...@gm...>: > > > Debayan, > > > > I have been meaning to ask you: is your character segmentation > > algorithm in a form that could be easily separated out? > > The segmentation algorithm can be found here > (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf) But this is your original algorithm which segmented গ etc (at least for some fonts), isn't it? I thought you had an improved algorithm which works around some of those problems (or maybe I misunderstood your mail). > > If it could be > > easily done, I would like to try it out in BOCRA. Unfortunately, I > > don't think I will have enough time in the near future to figure out > > how ocropus/tesseract does things. > > > Kindly read the paragraph in this > > (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) > > post regarding reducing number of character classes to be trained. I > want to know if this is possible using BOCRA. No it's not. From the beginning, my design for BOCRA was based on the idea of on-the-fly training, because that's the only approach I thought was feasible given the combination of non-standard fonts and so many potential conjuncts. In most realistic examples, the number of conjuncts is actually quite limited. After accounting for the most common ones, the frequency of the rest are probably lower than normal OCR error rate anyway. -Deepayan |