[Jocr-devels] cyrillic and other languages support plan
Status: Alpha
Brought to you by:
joerg10
|
From: Emanoil K. <del...@ya...> - 2006-09-05 23:37:03
|
Hi, is it possible to discuss an implementation of cyrillic support into gocr. I came to the conclusion that it is NOT that easy to implement this piece of code (not only tests are needed) becase I think following should be considered: Suppose we have 3 cases and I take cyrillic as an example, but it could be any other language: 1) The hole document is in cyrillic 2) few words are cyrillic 3) few characters are cyrillic My consideration is that recognising a cyrillic character in case 3 should lead to presumption that case 2 and/or 1 may be also valid, but becase we first recognise latin letters we may have falseley recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as latin [a] a, because there is no difference, but for the correct output the correct code page should be set (unicode range). Thus setting language probability on word and document level seems good idea to me (may be mentain charackters frequncy list). A language option for setting language explicitly besides locales is also a very good idea. If I knew that I am parsing cyrillic text I could tell this to gocr and my test should have precedence over the latin ones. I think solving this problem should also encrease the ability to implement other languages to gocr and possibly text language identification which is another topic. Please, have a look at the other postings on this subject at: https://sourceforge.net/tracker/?func=detail&atid=357147&aid=664374&group_id=7147 I wrote a script to create cyrillic db (create_db) and the needed header files. But right now I'm pretty occupied to work on. I think a strategy should be set to support at least all of the european languages that are not far from each other and are going to be used in future pretty widely and mixed. what do you think? and thank you for your patience, but I sow that there is discussion this evening on the list :-) + Deloptes ←:→ + penguin friendly --------------------------------- How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates. |