Thread: [Jocr-devels] Is the C++ transition still in the pipeline?
Status: Alpha
Brought to you by:
joerg10
|
From: Rupert S. <rup...@li...> - 2006-09-04 21:06:55
Attachments:
signature.asc
|
Hello, I'm new to the GOCR project, but I've been using the end-product quite a bit (and getting annoyed because no open source OCR can do the sort of font found in academic journals, but that's beside the point). I downloaded the CVS code the other night, and (forgetting to look in the mailing list) read through the code and came to similar conclusions to Wade although admittedly not in quite so much detail. I'm now really excited, seeing how enthusiastic you all are about porting across to C++ and would love to help, partially to get away from the C# M$ horribleness I'm doing in my day job at the moment! I find the project really exciting, and my belief is that if we can re-engineer the existing code base to be easier to work on, many more people will come and contribute to what is, let's face it, a pretty cool area of application design. Beats writing web applications any day! Well, I'd love to see a status report - could the C++ semi-port go into a separate branch in CVS maybe? Looking forward to hearing more, Rupert |
|
From: Joerg S. <Joe...@UR...> - 2006-09-05 08:26:11
|
Hi, Conversion of C++ is not in my pipeline. I am not against C++ in general. Its more due to the fact, that my time is limited and I dont want to spend lot of them for rearranging code and fix new bugs (Am I probably to old?). I also see it as a problem that object oriented code needs a careful design. Gocr isnt designed that way, because I still not know what a good design would be. I still learning (by doing), how OCR could be made better. If we change to C++ we have to choose one design and I think its more difficult to through it away, if it was choosen bad. Also I dont believe that C++ code would involve more people who contribute better algorithm than C code would do. I think bad OCR is a problem of missing good concepts and clever algorithms than bad OO design. I have also less experiences in searching bugs in OO code. That are my main reasons against C++. (b.t.w.: Why linux kernel isnt C++ code?) So I think its simply to early (especially for me) to make such a one way decission to C++. But its open source code, if you want to have a C++ version you can do it but probably without me. I have really no time for it as long as my child is young and needs 90% of my attention after work. Its a bit similar to the libgocr part, where I did not find the time to get inolved in. Student time is gone unfortunately where I could spent 12 hours per day for progging. I have no problems to make it C++ compatible (remove C++ warnings) or create C compatible C++ interfaces to help a bit. I also want to add more explanations and comments to the code for easier understanding. Thats all I am able to do. Regards, Joerg. |
|
From: Rupert S. <rup...@li...> - 2006-09-05 08:55:07
|
Joerg Schulenburg wrote: > Hi, > > Conversion of C++ is not in my pipeline. > I am not against C++ in general. Its more due to the fact, that my time is > limited and I dont want to spend lot of them for rearranging code and fix > new bugs (Am I probably to old?). > I also see it as a problem that object oriented code needs a careful > design. Gocr isnt designed that way, because I still not know what a good > design would be. I still learning (by doing), how OCR could be made > better. > If we change to C++ we have to choose one design and I > think its more difficult to through it away, if it was choosen bad. > Also I dont believe that C++ code would involve more people who contribute > better algorithm than C code would do. > I think bad OCR is a problem of missing good concepts and clever > algorithms than bad OO design. > I have also less experiences in searching bugs in OO code. > That are my main reasons against C++. OK. That makes sense, I suppose. > (b.t.w.: Why linux kernel isnt C++ code?) Linus being reactive? Also there's a huge body of code there, which is _very_ well organised. I also think that C++ can hit performance if one makes a small mistake in the coding, and that sort of sensitivity could be disastrous to the kernel, maybe? > > So I think its simply to early (especially for me) to make such a one way > decission to C++. > > But its open source code, if you want to have a C++ version you can do it > but probably without me. I have really no time for it as long as my child > is young and needs 90% of my attention after work. > Its a bit similar to the libgocr part, where I did not find the time to > get inolved in. > Student time is gone unfortunately where I could spent 12 hours per day > for progging. I also have no wish to go it alone - I'd like to contribute to the project, not detract from it! However, would you be interested in, for example, me refactoring the pnm code - which seems to be a little buggy in places? If the problem is purely not wanting to make such a huge shift, it might still be worth switching across the more modular parts at least - we can always make the interface via the header files identical with extern "C" functions, but it might make the thing simpler? The only question is are you willing to have a compile-time dependency on a c++ compiler? > I have no problems to make it C++ compatible (remove C++ warnings) or > create C compatible C++ interfaces to help a bit. > I also want to add more explanations and comments to the code for easier > understanding. Thats all I am able to do. Well, I don't want to start coding if you won't use the result, but I could always try porting across a self-contained bit and submit the (fully documented?) .cc alternative? Rupert |
|
From: Bruno B. G. <br...@gm...> - 2006-09-05 14:15:39
|
Rupert Swarbrick wrote: >>But its open source code, if you want to have a C++ version you can do it >>but probably without me. I have really no time for it as long as my child >>is young and needs 90% of my attention after work. >>Its a bit similar to the libgocr part, where I did not find the time to >>get inolved in. >>Student time is gone unfortunately where I could spent 12 hours per day >>for progging. > > > I also have no wish to go it alone - I'd like to contribute to the > project, not detract from it! However, would you be interested in, for > example, me refactoring the pnm code - which seems to be a little buggy > in places? > > If the problem is purely not wanting to make such a huge shift, it might > still be worth switching across the more modular parts at least - we can > always make the interface via the header files identical with extern "C" > functions, but it might make the thing simpler? > > The only question is are you willing to have a compile-time dependency > on a c++ compiler? Well, the project took off (thanks to Wade), but it's stalled by now because he doesn't have time. I'm interested and I'll probably find some time in the next few months (this year) to help. The new project, called Conjecture, is a framework for OCRing and integrates with gocr as gocr is right now. So we are not reinventing the wheel or losing any of the remarkable work Joerg has done so far, but only building on it and making it easier to write new engines with different algorithms The code and the documentation (very extensive, Wade was very thorough with it) are at: http://www.holst.ca/Conjecture/ (Wade's site, seems to be offline) http://www.corollarium.com/Conjecture/ (a mirror, online) There are plans to move it to sf.net. >>I have no problems to make it C++ compatible (remove C++ warnings) or >>create C compatible C++ interfaces to help a bit. >>I also want to add more explanations and comments to the code for easier >>understanding. Thats all I am able to do. > > > Well, I don't want to start coding if you won't use the result, but I > could always try porting across a self-contained bit and submit the > (fully documented?) .cc alternative? Take a look at the site. The SVN rep is offline because it was hosted at holst.ca, but you can get the latest version and I can send you a copy of my tree (although I think they are very similar, if not the same). -- Bruno Barberi Gnecco <brunobg_at_users.sourceforge.net> Life is like an onion: you peel off layer after layer and then you find there is nothing in it. -- James Huneker |
|
From: Rupert S. <rup...@li...> - 2006-09-05 14:23:17
|
Bruno Barberi Gnecco wrote: > > Take a look at the site. The SVN rep is offline because it > was hosted at holst.ca, but you can get the latest version and I > can send you a copy of my tree (although I think they are very > similar, if not the same). > Thank you very much for taking the time to reply - I'll have a look this evening. Rupert |
|
From: Emanoil K. <del...@ya...> - 2006-09-05 23:37:03
|
Hi, is it possible to discuss an implementation of cyrillic support into gocr. I came to the conclusion that it is NOT that easy to implement this piece of code (not only tests are needed) becase I think following should be considered: Suppose we have 3 cases and I take cyrillic as an example, but it could be any other language: 1) The hole document is in cyrillic 2) few words are cyrillic 3) few characters are cyrillic My consideration is that recognising a cyrillic character in case 3 should lead to presumption that case 2 and/or 1 may be also valid, but becase we first recognise latin letters we may have falseley recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as latin [a] a, because there is no difference, but for the correct output the correct code page should be set (unicode range). Thus setting language probability on word and document level seems good idea to me (may be mentain charackters frequncy list). A language option for setting language explicitly besides locales is also a very good idea. If I knew that I am parsing cyrillic text I could tell this to gocr and my test should have precedence over the latin ones. I think solving this problem should also encrease the ability to implement other languages to gocr and possibly text language identification which is another topic. Please, have a look at the other postings on this subject at: https://sourceforge.net/tracker/?func=detail&atid=357147&aid=664374&group_id=7147 I wrote a script to create cyrillic db (create_db) and the needed header files. But right now I'm pretty occupied to work on. I think a strategy should be set to support at least all of the european languages that are not far from each other and are going to be used in future pretty widely and mixed. what do you think? and thank you for your patience, but I sow that there is discussion this evening on the list :-) + Deloptes ←:→ + penguin friendly --------------------------------- How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates. |
|
From: Joerg <Joe...@UR...> - 2006-09-07 21:45:56
|
> is it possible to discuss an implementation of cyrillic support into > gocr. ... > Suppose we have 3 cases and I take cyrillic as an example, but it > could be any other language: > > 1) The hole document is in cyrillic > 2) few words are cyrillic > 3) few characters are cyrillic > > My consideration is that recognising a cyrillic character in case 3 > should lead to presumption that case 2 and/or 1 may be also valid, but > becase we first recognise latin letters we may have falseley > recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as > latin [a] a, because there is no difference, but for the correct > output the correct code page should be set (unicode range). Thus > setting language probability on word and document level seems good > idea to me (may be mentain charackters frequncy list). I think such questions are not a hot topic at the current state of the program. Sed would do a sufficient job. > I wrote a script to create cyrillic db (create_db) and the needed > header files. But right now I'm pretty occupied to work on. The db part of gocr is written bad. I did not thought much about it. Pixel based algorithm will be replaced by a vector based algorithm. > I think a strategy should be set to support at least all of the > european languages that are not far from each other and are going to be > used in future pretty widely and mixed. > > what do you think? my strategy is to lift up the recognition of latin chars to a acceptable level before adding new chars or languages. db support for other languages is the maximum I can do at the moment. Joerg. |