The more general Tesseract class is not final, so you can certainly extend it to expose more functionality provided by the lower level TessAPI interface.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Extending Tesseract does not help too much, as still whole method Tesseract.doOCR(int xsize, int ysize, ByteBuffer buf, Rectangle rect, int bpp) should be copy-pasted. Would be nice if, for example, initialization block would be extracted to separate function:
protected TessAPI.TessBaseAPI prepareTessAPI(int xsize, int ysize, ByteBuffer buf, Rectangle rect, int bpp) {
TessAPI api = TessAPI.INSTANCE;
TessAPI.TessBaseAPI handle = api.TessBaseAPICreate();
...
api.TessBaseAPISetRectangle(handle, rect.x, rect.y, rect.width, rect.height);
retrun handle;
}
doOCR is a simple method that encapsulates Tesseract engine initialization, processing a single image, and then shutdown. It is not efficient if you process multiple images. Sure you can override it with a more efficient algorithm in which the engine is initialized once, processes or manipulates all the images, and finally shuts down to release used resources.
Due to my personal work, it could be some time before I can get back on this. You're welcome to submit a patch. Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am attaching my first attempt for your review. From my perspective it is a step for better because:
Batch image processing is now faster, as init() / dispose() are called only once.
Class that extends Tesseract1 can implement other OCRing strategy easily, as all needed functions are now a separate blocks.
Notes:
I think that all occurrences of IIOImage can be replaced by RenderedImage with no impact as IIOImage is used as wrapper for RenderedImage. ImageIO is forced to read thumbnails, which are not used.
Using of System.err in the library is mauvais ton, as if it is used in AS application, you don't know where it is logged to (if logged at all). So logger, logger is the way out. Or throw further.
Having two approaches as Tesseract1 and Tesseract makes no sense to me. If one is left it simplifies the development, reduces code duplication. Extension (Tesseract1) or aggregation (Tesseract): you need to choose one. I personally thing that extension (Tesseract1) is more natural in respect to handle.
It turns out that Tesseract class cannot be extended due to the private constructor. Tesseract1 is the extensible one here as necessary elements are exposed for access to inheriting classes.
I incorporated many of your suggestions, including logging, into the code baseline. Tesseract is maintained because the alternative direct mapping method that Tesseract1 is based on was until recently still an experimental feature for JNA.
Please help test the changes. Version 1.2 will be released soon. Thanks.
Quan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-09-09
Could you upload .jar and .source.jar somewhere (e.g. Maven snapshots)? I will test against binaries that you will create when you make a release.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-05-03
I know this issue is a years old, but I'm wondering what is the current 'best' way to get the confidences? Like others, I am also confused by the difference between Tesseract vs Tesseract1 and TessAPI vs TessAPI1
I see what you said about doOcr() being intended for a single image because it shuts down after processing. What is the best way to be able to process multiple images? Is there any documentation on the best way to do this (as well as getting the confidences)
thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see TessBaseAPIAllWordConfidences, which says that it returns the same number of values as that returned by GetUTF8. But TessBaseAPIGetUTF8Text returns a single string, not an array. Can you provide an example? I've read the Javadoc, but it's not always clear without an example.
Is there an efficient way to process multiple images, but one at a time, without sending them all in as an array.
TessBaseAPIAllWordConfidences() doesn't seem to work with doOCR(), because doOCR() closes everything down instead of leaving it open for the TessBaseAPIAllWordConfidences() call
Last edit: Peter Kronenberg 2021-05-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
They are supported.
http://tess4j.sourceforge.net/docs/docs-1.1/
Indeed they are available in
TessAPI
, but handle is deleted indoOCR()
. What should be the flow then? Expected something like:The more general
Tesseract
class is not final, so you can certainly extend it to expose more functionality provided by the lower levelTessAPI
interface.Extending
Tesseract
does not help too much, as still whole methodTesseract.doOCR(int xsize, int ysize, ByteBuffer buf, Rectangle rect, int bpp)
should be copy-pasted. Would be nice if, for example, initialization block would be extracted to separate function:plus maybe another helper:
Basically, above mentioned
doOCR()
is now decomposed:Now extending
Tesseract
makes sense. If you have another scenario in mind, please share a complete example.Another note: I think that expression
should be better turned into:
and then one don't need static
EMPTY_RECTANGLE
.doOCR
is a simple method that encapsulates Tesseract engine initialization, processing a single image, and then shutdown. It is not efficient if you process multiple images. Sure you can override it with a more efficient algorithm in which the engine is initialized once, processes or manipulates all the images, and finally shuts down to release used resources.Due to my personal work, it could be some time before I can get back on this. You're welcome to submit a patch. Thanks.
I am attaching my first attempt for your review. From my perspective it is a step for better because:
init()
/dispose()
are called only once.Tesseract1
can implement other OCRing strategy easily, as all needed functions are now a separate blocks.Notes:
IIOImage
can be replaced byRenderedImage
with no impact asIIOImage
is used as wrapper forRenderedImage
. ImageIO is forced to read thumbnails, which are not used.System.err
in the library is mauvais ton, as if it is used in AS application, you don't know where it is logged to (if logged at all). So logger, logger is the way out. Or throw further.Tesseract1
andTesseract
makes no sense to me. If one is left it simplifies the development, reduces code duplication. Extension (Tesseract1
) or aggregation (Tesseract
): you need to choose one. I personally thing that extension (Tesseract1
) is more natural in respect tohandle
.Dmitry,
It turns out that
Tesseract
class cannot be extended due to the private constructor.Tesseract1
is the extensible one here as necessary elements are exposed for access to inheriting classes.I incorporated many of your suggestions, including logging, into the code baseline.
Tesseract
is maintained because the alternative direct mapping method thatTesseract1
is based on was until recently still an experimental feature for JNA.Please help test the changes. Version 1.2 will be released soon. Thanks.
Quan
Could you upload .jar and .source.jar somewhere (e.g. Maven snapshots)? I will test against binaries that you will create when you make a release.
1.2-Beta attached.
Fixed with release of v1.2. Special thanks to Dmitry Katsubo for the software patch, testing, and valuable suggestions.
I know this issue is a years old, but I'm wondering what is the current 'best' way to get the confidences? Like others, I am also confused by the difference between Tesseract vs Tesseract1 and TessAPI vs TessAPI1
I see what you said about doOcr() being intended for a single image because it shuts down after processing. What is the best way to be able to process multiple images? Is there any documentation on the best way to do this (as well as getting the confidences)
thank you
I just entered that last post, but I wasn't logged in.
Documentation: http://tess4j.sourceforge.net/docs/docs-4.4/
You can pass in a
List<IIOImage>
todoOCR
method. There are other methods inTesseract
class that returns confidence values.JNA Direct Mapping: https://github.com/java-native-access/jna/blob/master/www/DirectMapping.md
I see TessBaseAPIAllWordConfidences, which says that it returns the same number of values as that returned by GetUTF8. But TessBaseAPIGetUTF8Text returns a single string, not an array. Can you provide an example? I've read the Javadoc, but it's not always clear without an example.
Is there an efficient way to process multiple images, but one at a time, without sending them all in as an array.
TessBaseAPIAllWordConfidences() doesn't seem to work with doOCR(), because doOCR() closes everything down instead of leaving it open for the TessBaseAPIAllWordConfidences() call
Last edit: Peter Kronenberg 2021-05-04
Please continue the discussion either in the Discussion section or over on GitHub site rather than on this old, closed ticket.
Thanks.