Re: [Jocr-devels] cyrillic and other languages support plan
Status: Alpha
Brought to you by:
joerg10
|
From: Joerg <Joe...@UR...> - 2006-09-07 21:45:56
|
> is it possible to discuss an implementation of cyrillic support into > gocr. ... > Suppose we have 3 cases and I take cyrillic as an example, but it > could be any other language: > > 1) The hole document is in cyrillic > 2) few words are cyrillic > 3) few characters are cyrillic > > My consideration is that recognising a cyrillic character in case 3 > should lead to presumption that case 2 and/or 1 may be also valid, but > becase we first recognise latin letters we may have falseley > recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as > latin [a] a, because there is no difference, but for the correct > output the correct code page should be set (unicode range). Thus > setting language probability on word and document level seems good > idea to me (may be mentain charackters frequncy list). I think such questions are not a hot topic at the current state of the program. Sed would do a sufficient job. > I wrote a script to create cyrillic db (create_db) and the needed > header files. But right now I'm pretty occupied to work on. The db part of gocr is written bad. I did not thought much about it. Pixel based algorithm will be replaced by a vector based algorithm. > I think a strategy should be set to support at least all of the > european languages that are not far from each other and are going to be > used in future pretty widely and mixed. > > what do you think? my strategy is to lift up the recognition of latin chars to a acceptable level before adding new chars or languages. db support for other languages is the maximum I can do at the moment. Joerg. |