Re: [Jocr-devels] cyrillic and other languages support plan

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>  is it possible to discuss an implementation of cyrillic support into 
>  gocr.
...
>   Suppose we have 3 cases and I take cyrillic as an example, but it 
>   could be any other language:
>   
>   1) The hole document is in cyrillic
>   2) few words are cyrillic
>   3) few characters are cyrillic
>   
>   My consideration is that recognising a cyrillic character in case 3 
>   should lead to presumption that case 2 and/or 1 may be also valid, but 
>   becase we first recognise latin letters we may have falseley 
>   recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as 
>   latin [a] a, because there is no difference, but for the correct 
>   output the correct code page should be set (unicode range). Thus 
>   setting language probability on word and document level seems good 
>   idea to me (may be mentain charackters frequncy list).

I think such questions are not a hot topic at the current state of the 
program. Sed would do a sufficient job.

>  I wrote a script to create cyrillic db (create_db) and the needed 
>  header files. But right now I'm pretty occupied to work on.

The db part of gocr is written bad. I did not thought much about it. Pixel 
based algorithm will be replaced by a vector based algorithm.

>  I think a strategy should be set to support at least all of the 
>  european languages that are not far from each other and are going to be 
>  used in future pretty widely and mixed.
>  
>  what do you think?

my strategy is to lift up the recognition of latin chars to a acceptable 
level before adding new chars or languages.
db support for other languages is the maximum I can do at 
the moment.

Joerg.