Thread: [Jocr-devels] Is the C++ transition still in the pipeline?

Status: Alpha

Brought to you by: joerg10

jocr-devels

[Jocr-devels] Is the C++ transition still in the pipeline?

From: Rupert S. <rup...@li...> - 2006-09-04 21:06:55

Attachments: signature.asc

Hello,

I'm new to the GOCR project, but I've been using the end-product quite a
bit (and getting annoyed because no open source OCR can do the sort of
font found in academic journals, but that's beside the point).

I downloaded the CVS code the other night, and (forgetting to look in
the mailing list) read through the code and came to similar conclusions
to Wade although admittedly not in quite so much detail.

I'm now really excited, seeing how enthusiastic you all are about
porting across to C++ and would love to help, partially to get away from
the C# M$ horribleness I'm doing in my day job at the moment!

I find the project really exciting, and my belief is that if we can
re-engineer the existing code base to be easier to work on, many more
people will come and contribute to what is, let's face it, a pretty cool
area of application design. Beats writing web applications any day!

Well, I'd love to see a status report - could the C++ semi-port go into
a separate branch in CVS maybe?

Looking forward to hearing more,

Rupert

Re: [Jocr-devels] Is the C++ transition still in the pipeline?

From: Joerg S. <Joe...@UR...> - 2006-09-05 08:26:11

Hi,

Conversion of C++ is not in my pipeline.
I am not against C++ in general. Its more due to the fact, that my time is 
limited and I dont want to spend lot of them for rearranging code and fix 
new bugs (Am I probably to old?).
I also see it as a problem that object oriented code needs a careful 
design. Gocr isnt designed that way, because I still not know what a good 
design would be. I still learning (by doing), how OCR could be made 
better.
If we change to C++ we have to choose one design and I 
think its more difficult to through it away, if it was choosen bad.
Also I dont believe that C++ code would involve more people who contribute
better algorithm than C code would do.
I think bad OCR is a problem of missing good concepts and clever 
algorithms than bad OO design.
I have also less experiences in searching bugs in OO code.
That are my main reasons against C++.
(b.t.w.: Why linux kernel isnt C++ code?)

So I think its simply to early (especially for me) to make such a one way 
decission to C++.

But its open source code, if you want to have a C++ version you can do it 
but probably without me. I have really no time for it as long as my child
is young and needs 90% of my attention after work. 
Its a bit similar to the libgocr part, where I did not find the time to
get inolved in.
Student time is gone unfortunately where I could spent 12 hours per day 
for progging.
 
I have no problems to make it C++ compatible (remove C++ warnings) or 
create C compatible C++ interfaces to help a bit.
I also want to add more explanations and comments to the code for easier
understanding. Thats all I am able to do.

Regards,
Joerg.

Re: [Jocr-devels] Is the C++ transition still in the pipeline?

From: Rupert S. <rup...@li...> - 2006-09-05 08:55:07

Joerg Schulenburg wrote:
> Hi,
> 
> Conversion of C++ is not in my pipeline.
> I am not against C++ in general. Its more due to the fact, that my time is 
> limited and I dont want to spend lot of them for rearranging code and fix 
> new bugs (Am I probably to old?).
> I also see it as a problem that object oriented code needs a careful 
> design. Gocr isnt designed that way, because I still not know what a good 
> design would be. I still learning (by doing), how OCR could be made 
> better.
> If we change to C++ we have to choose one design and I 
> think its more difficult to through it away, if it was choosen bad.
> Also I dont believe that C++ code would involve more people who contribute
> better algorithm than C code would do.
> I think bad OCR is a problem of missing good concepts and clever 
> algorithms than bad OO design.
> I have also less experiences in searching bugs in OO code.
> That are my main reasons against C++.

OK. That makes sense, I suppose.

> (b.t.w.: Why linux kernel isnt C++ code?)

Linus being reactive? Also there's a huge body of code there, which is 
_very_ well organised. I also think that C++ can hit performance if one 
makes a small mistake in the coding, and that sort of sensitivity could 
be disastrous to the kernel, maybe?

> 
> So I think its simply to early (especially for me) to make such a one way 
> decission to C++.
> 
> But its open source code, if you want to have a C++ version you can do it 
> but probably without me. I have really no time for it as long as my child
> is young and needs 90% of my attention after work. 
> Its a bit similar to the libgocr part, where I did not find the time to
> get inolved in.
> Student time is gone unfortunately where I could spent 12 hours per day 
> for progging.

I also have no wish to go it alone - I'd like to contribute to the 
project, not detract from it! However, would you be interested in, for 
example, me refactoring the pnm code - which seems to be a little buggy 
in places?

If the problem is purely not wanting to make such a huge shift, it might 
still be worth switching across the more modular parts at least - we can 
always make the interface via the header files identical with extern "C" 
functions, but it might make the thing simpler?

The only question is are you willing to have a compile-time dependency 
on a c++ compiler?

> I have no problems to make it C++ compatible (remove C++ warnings) or 
> create C compatible C++ interfaces to help a bit.
> I also want to add more explanations and comments to the code for easier
> understanding. Thats all I am able to do.

Well, I don't want to start coding if you won't use the result, but I 
could always try porting across a self-contained bit and submit the 
(fully documented?) .cc alternative?

Rupert

Re: [Jocr-devels] Is the C++ transition still in the pipeline?

From: Bruno B. G. <br...@gm...> - 2006-09-05 14:15:39

Rupert Swarbrick wrote:

>>But its open source code, if you want to have a C++ version you can do it 
>>but probably without me. I have really no time for it as long as my child
>>is young and needs 90% of my attention after work. 
>>Its a bit similar to the libgocr part, where I did not find the time to
>>get inolved in.
>>Student time is gone unfortunately where I could spent 12 hours per day 
>>for progging.
> 
> 
> I also have no wish to go it alone - I'd like to contribute to the 
> project, not detract from it! However, would you be interested in, for 
> example, me refactoring the pnm code - which seems to be a little buggy 
> in places?
> 
> If the problem is purely not wanting to make such a huge shift, it might 
> still be worth switching across the more modular parts at least - we can 
> always make the interface via the header files identical with extern "C" 
> functions, but it might make the thing simpler?
> 
> The only question is are you willing to have a compile-time dependency 
> on a c++ compiler?

	Well, the project took off (thanks to Wade), but it's stalled
by now because he doesn't have time. I'm interested and I'll probably
find some time in the next few months (this year) to help. The new
project, called Conjecture, is a framework for OCRing and integrates
with gocr as gocr is right now. So we are not reinventing the wheel
or losing any of the remarkable work Joerg has done so far, but only
building on it and making it easier to write new engines with
different algorithms

	The code and the documentation (very extensive, Wade was
very thorough with it) are at:

http://www.holst.ca/Conjecture/  (Wade's site, seems to be offline)
http://www.corollarium.com/Conjecture/   (a mirror, online)

	There are plans to move it to sf.net.

>>I have no problems to make it C++ compatible (remove C++ warnings) or 
>>create C compatible C++ interfaces to help a bit.
>>I also want to add more explanations and comments to the code for easier
>>understanding. Thats all I am able to do.
> 
> 
> Well, I don't want to start coding if you won't use the result, but I 
> could always try porting across a self-contained bit and submit the 
> (fully documented?) .cc alternative?

	Take a look at the site. The SVN rep is offline because it
was hosted at holst.ca, but you can get the latest version and I
can send you a copy of my tree (although I think they are very
similar, if not the same).

-- 
Bruno Barberi Gnecco <brunobg_at_users.sourceforge.net>
Life is like an onion: you peel off layer after
layer and then you find there is nothing in it.
		-- James Huneker

Re: [Jocr-devels] Is the C++ transition still in the pipeline?

From: Rupert S. <rup...@li...> - 2006-09-05 14:23:17

Bruno Barberi Gnecco wrote:
 >
 >     Take a look at the site. The SVN rep is offline because it
 > was hosted at holst.ca, but you can get the latest version and I
 > can send you a copy of my tree (although I think they are very
 > similar, if not the same).
 >
Thank you very much for taking the time to reply - I'll have a look this 
evening.

Rupert

[Jocr-devels] cyrillic and other languages support plan

From: Emanoil K. <del...@ya...> - 2006-09-05 23:37:03

Hi,

is it possible to discuss an implementation of cyrillic support into gocr. I came to the conclusion that it is NOT that easy to implement this piece of code (not only tests are needed) becase I think following should be considered:

Suppose we have 3 cases and I take cyrillic as an example, but it could be any other language:

1) The hole document is in cyrillic
2) few words are cyrillic
3) few characters are cyrillic

My consideration is that recognising a cyrillic character in case 3 should lead to presumption that case 2 and/or 1 may be also valid, but becase we first recognise latin letters we may have falseley recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as latin [a] a, because there is no difference, but for the correct output the correct code page should be set (unicode range).
Thus setting language probability on word and document level seems good idea to me (may be mentain charackters frequncy list).

A language option for setting language explicitly besides locales is also a very good idea. If I knew that I am parsing cyrillic text I could tell this to gocr and my test should have precedence over the latin ones.

I think solving this problem should also encrease the ability to implement other languages to gocr and possibly text language identification which is another topic.

Please, have a look at the other postings on this subject at:
https://sourceforge.net/tracker/?func=detail&atid=357147&aid=664374&group_id=7147

I wrote a script to create cyrillic db (create_db) and the needed header files. But right now I'm pretty occupied to work on.

I think a strategy should be set to support at least all of the european languages that are not far from each other and are going to be used in future pretty widely and mixed.

what do you think?

and thank you for your patience, but I sow that there is discussion this evening on the list :-)

+ Deloptes
&#8592;:&#8594;
+ penguin friendly

---------------------------------
How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates.

Re: [Jocr-devels] cyrillic and other languages support plan

From: Joerg <Joe...@UR...> - 2006-09-07 21:45:56

>  is it possible to discuss an implementation of cyrillic support into 
>  gocr.
...
>   Suppose we have 3 cases and I take cyrillic as an example, but it 
>   could be any other language:
>   
>   1) The hole document is in cyrillic
>   2) few words are cyrillic
>   3) few characters are cyrillic
>   
>   My consideration is that recognising a cyrillic character in case 3 
>   should lead to presumption that case 2 and/or 1 may be also valid, but 
>   becase we first recognise latin letters we may have falseley 
>   recognised cyrillic [es] 'c' as latin [si] c and cyrillic [a] a as 
>   latin [a] a, because there is no difference, but for the correct 
>   output the correct code page should be set (unicode range). Thus 
>   setting language probability on word and document level seems good 
>   idea to me (may be mentain charackters frequncy list).

I think such questions are not a hot topic at the current state of the 
program. Sed would do a sufficient job.
  
>  I wrote a script to create cyrillic db (create_db) and the needed 
>  header files. But right now I'm pretty occupied to work on.

The db part of gocr is written bad. I did not thought much about it. Pixel 
based algorithm will be replaced by a vector based algorithm.
  
>  I think a strategy should be set to support at least all of the 
>  european languages that are not far from each other and are going to be 
>  used in future pretty widely and mixed.
>  
>  what do you think?

my strategy is to lift up the recognition of latin chars to a acceptable 
level before adding new chars or languages.
db support for other languages is the maximum I can do at 
the moment.
  
Joerg.