I would like to use JavaOCR as a library in my java code in order to support some OCR related functionality. The only information I could find on how to do this came from http://www.roncemer.com/software-development/java-ocr, however the "Using the Code in Your Program" section there seems to be outdated. From looking at the code of the JavaOCR gui I understand that I need to use an OCRScanner in my code, which I will need to train and then invoke its scan method, however I'm not at all certain what the various arguments of the scan method do. Is there a place where I can find some more detailed documentation on this? Is it possible to perform OCR on images without having to train the OCRScanner first?
Anonymous
Hi Christina,
Currently there is no documentation other than the original article on my website at http://www.roncemer.com , the JavaDoc API documentation (which should be built automatically whenever you build the project), and whatever documentation may have been contributed by the other two developers who have been working on the project since I released it. But we'd welcome any contributions you'd like to make in that area. If you're interested, we could add you as a contributor to the project. Just let me know.
Thanks!
Ron
Hi Ron.
Thanks for your reply. It seems I'll have to make do with the API documentation for the time being. However, I would be more than grateful if you or one of the other developers could answer my question whether it is possible to perform OCR on images without having to train the OCRScanner every time I create a new OCRScanner instance.
If during my experiments with JavaOCR I get the time to write something that could be used as documentation I will let you know.
Thanks again,
Christina
Hi Christina,
You're not bothering me at all! I'm happy to see people put the code to use.
The algorithm is a very simple image-matching algorithm using a least-mean-square-error formula to score each training image's resemblance to the character being decoded. So without having the training images in memory, it won't be able to recognize any characters.
It's not really so much a "training" process as just the process of loading up the reference (training) images into memory so it has something to compare against.
This OCR engine is font-specific, BTW. So for each font you want it to recognize, you need to have training images for all of the characters you want it to recognize in that font.
Hope that helps!
Ron
Hi Ron.
Thanks for your input. If I have any more questions or comments while working with JavaOCR I'll let you know.
Christina
Hello Ron.
I've starting using JavaOCR with images I am creating myself and it seems that it's having a hard time recognising how many characters are in each training image. So, I keep getting error messages like this when loading the training images:
Expected to decode 26 characters but actually decoded 29 characters in training
The method I am using to create the training images is nothing really fancy. I am just writing the characters I want in MSWord and take a print screen of the area containing the characters in order to save them as an image.
Is there something in particular I should be doing in order to create my images (ex. use a specific font size, a specific number of spaces between the characters or a specific image format)? How do you think I could get over this problem?
Thank you,
Christina
And one more question: Are there any plans to make JavaOCR available through Maven?
Hi Christina,
I recommend looking at the minCharBreakWidthAsFractionOfRowHeight attribute of the DocumentScanner class. There are accessor methods to get and set this value. It defaults to 0.05. If you increase it a little, it may help with your font. The algorithm which determines where one character ends and another begins, is very simplistic. It sounds as if the scanner is finding more character breaks than are acutally there.
Also, be careful not to include any "dust" or non-white pixels in your image that aren't part of the characters you're trying to get it to recognize.
The DocumentScanner class is used to both load training images and scan documents, so the same algorithms are used for breaking apart the characters in both the training images and the documents themseleves. That way, the thing is consistent in how it handles any specific font.
Another thing that may help, is to fiddle with whiteThreshold, which ranges from 0 to 255 inclusive, and defaults to 127. There are also accessor methods for this attribute.
Sourceforge user ko5tik is working on adding maven build capability. Sorry, I think I forgot to answer this before.
I'm not sure whether he's planning on uploading it to any maven repository, but it does seem like that would be an ongoing pain that someone would have to go through in order to keep the current version up on that repository, unless sourceforge or the target repo can do that automatically somehow.
I know nothing about maven, except that it's an apache project that provides build capabilities similar to ant, so please forgive me if I'm a little uninformed on this subject. It would probably be a good idea to send a private message to ko5tik and see if he has any plans for maven support beyond just providing maven build capability.
HI all,
Maven build is in repository, core is separeated from app, and builds already.
App needs some work though. It's not clear where to upload maven artefacts - I can
provide my private repository though.
Hi Christina,
I already have some artifacts on maven central - and I'm aware that it takes a long time to deploy something there. Unfortunqately
javaocr is not in state deployable on central repos (there is still work necessary) - so the best option will be to build it yourself into your
local repo ( core shall build fine ) or wait till today evenyng
when I can deploy it to my private repo under:
http://www.pribluda.de/m2
BTW, please check your SF mail alias as I can not reply to it vie email
I deployed core and parent snapshot in my private maven repository:
http://www.pribluda.de/m2
Coordinates:
<groupId>net.sourceforge.javaocr</groupId>
<artifactId>javaocr-parent</artifactId>
<packaging>pom</packaging>
<name>Java OCR Parent project</name>
<version>1.102-SNAPSHOT</version>
and:
<groupId>net.sourceforge.javaocr</groupId>
<artifactId>javaocr-core</artifactId>
<packaging>pom</packaging>
<name>Java OCR Parent project</name>
<version>1.102-SNAPSHOT</version>
You need to include core in your pom as dependency
Hi Konstantin.
I included java ocr core in my pom in order to get it from you repository. However, you might be interested in knowing that I get some warnings about failed checksums when downloading java ocr from the repository.
Thank you,
Christina
Is this project still being developed? Any plans to publish newer revs to mvn public repos (eg 1.1+)?
I have played with training and OCRScannerDemo, can someone please help with any of the following?
seems to be some deprecated code (eg DocumentScanner). Is there a newer ocr demo which uses non-deprecated code?
I have failed to train the letter 'H' in Broadway font, tried tweaking some values mentioned above. It manages to find 2 chars 'H' and 'I'. I think the problem is that the horizontal bar is single pixel, if I paint it using 2 pixel that works. Curious what setting/value will allow single pixel so i can use the font without custom changes. A blown up visual of the pixels is as follows:
A question about CharacterRange and the training API. Rather than the API requiring a min and max char in a strict range I'd prefer to have training images with any random set of characters not strictly in a range and have the API accept an array (or ordered list etc) of the characters in the image. Has anyone considered/implemented this in javaocr? Does it make sense to do this?
eg
Hi Craig, at the moment there is no active development. Project reached state
suitable for all developers so we do not have active plans. As for character ranges -
there is no strict defined character range. You may use whatever set you like.
While training, you just say - this glyph is for certain character - then data in matcher are updated.
Latest version of javaocr is published on maven central:
http://mvnrepository.com/artifact/net.sourceforge.javaocr/javaocr-core
( there is misleading latest package in SF download page )