|
From: Deepayan S. <dee...@gm...> - 2009-05-09 17:26:32
|
On 5/9/09, Debayan Banerjee <deb...@gm...> wrote: > 2009/5/9 Deepayan Sarkar <dee...@gm...>: > > > Debayan, > > > > I have been meaning to ask you: is your character segmentation > > algorithm in a form that could be easily separated out? > > The segmentation algorithm can be found here > (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf) But this is your original algorithm which segmented গ etc (at least for some fonts), isn't it? I thought you had an improved algorithm which works around some of those problems (or maybe I misunderstood your mail). > > If it could be > > easily done, I would like to try it out in BOCRA. Unfortunately, I > > don't think I will have enough time in the near future to figure out > > how ocropus/tesseract does things. > > > Kindly read the paragraph in this > > (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html) > > post regarding reducing number of character classes to be trained. I > want to know if this is possible using BOCRA. No it's not. From the beginning, my design for BOCRA was based on the idea of on-the-fly training, because that's the only approach I thought was feasible given the combination of non-standard fonts and so many potential conjuncts. In most realistic examples, the number of conjuncts is actually quite limited. After accounting for the most common ones, the frequency of the rest are probably lower than normal OCR error rate anyway. -Deepayan |