I am using Tess4j 2.0 Beta to create a searchable PDF as output. PDF is getting created, however the PDF is locked and file is open until JVM terminates. How do I force PDF to close as soon as completing OCR?
The PDF creation is handled by Tesseract, so it seems that the file handle is still not released by it. The PDF code was improved in Tesseract 3.04, so you may want to try that version. If the behavior remains the same, you may want to consider submitting an issue with them.
My code tries to open the converted PDF after OCR, thats where i found pdf is locked.
Also, PDF trailer chars (%%%%EOF\n) is getting appended to the output pdf only after terminating the pdf.
I also invoked TessResultRendererEndDocument() method in TessAPI, since this method (in renderer.cpp) seemed to call EndDocumentHandler() in pdfrenderer.cpp
However it looks like TessPDFRenderer::EndDocumentHandler() is never being invoked.
Last edit: Panneer 2015-02-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, I sandwiched ProcessPages with ResultRendererBeginDocument and ResultRendererEndDocument. See if that made a difference.
I also have long noticed that the generated PDF by TessAPI1 on Windows would sometimes display a 'GlyphLessFont' message on opening, indicating some trailer bytes did not get written to the file.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I believe this is a regression bug in Tesseract 3.04 beta, which is being bundled with Tess4J 3.0 beta. The bug probably has been fixed in later commits in Tesseract repo. I'm waiting for the availability of the final 3.04 DLL in order to test.
Quan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Indeed there was a bug in Tesseract 3.04 which is fixed in 3.04.01dev.
However in method
public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats)
you need to add a call to
api.TessDeleteResultRenderer(renderer);
I did this in the finally block and it works fine.
Also, I believe that calls to TessResultRendererBeginDocument() and TessResultRendererEndDocument() are not needed because tesseract does both inside the ProcessPages.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It probably has changed substantially since they refactored the ResultRenderer API.
I found calling TessDeleteResultRenderer would immediately crash the JVM, so it has been commented out for now as shown in the repo. Please share your code change.
Thank you.
Quan
Last edit: Quan Nguyen 2015-07-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This appears to be an issue with tess4j 3.0.0. II am running under windows 7 - 64 bit and java 1.8.0_71-b15. I noticed that the files were open as subsequent processing of files led to errors as the files had not been closed. I added TessDeleteResultRenderer as shown below, and things appear to work fine. Would like to see this fix applied.
private void createDocuments(String filename, TessResultRenderer renderer) throws TesseractException {
api.TessBaseAPISetInputName(handle, filename); //for reading a UNLV zone file
int result = api.TessBaseAPIProcessPages(handle, filename, null, 0, renderer);
api.TessDeleteResultRenderer(renderer);
if (result == ITessAPI.FALSE) {
throw new TesseractException("Error during processing page.");
}
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am using Tess4j 2.0 Beta to create a searchable PDF as output. PDF is getting created, however the PDF is locked and file is open until JVM terminates. How do I force PDF to close as soon as completing OCR?
Sample Code:
Last edit: Quan Nguyen 2015-01-29
The PDF creation is handled by Tesseract, so it seems that the file handle is still not released by it. The PDF code was improved in Tesseract 3.04, so you may want to try that version. If the behavior remains the same, you may want to consider submitting an issue with them.
https://code.google.com/p/tesseract-ocr/issues/list
Thanks for the response.
Where would I find DLLs for Tesseract 3.04?
They're available in the repository.
Thanks again. The behaviour is same in Tesseract 3.0.4, where file handler is not released. Looks like i may have to raise an issue with Tesseract.
Raised an issue with Tesseract
https://code.google.com/p/tesseract-ocr/issues/detail?id=1410
Thank you, Panneer.
How do you determine that a file is locked or open?
I look through Tesseract's pdfrenderer.cpp file, it has a few
fclose
but I'm not sure the PDF would be closed by that.Quan
My code tries to open the converted PDF after OCR, thats where i found pdf is locked.
Also, PDF trailer chars (%%%%EOF\n) is getting appended to the output pdf only after terminating the pdf.
I also invoked TessResultRendererEndDocument() method in TessAPI, since this method (in renderer.cpp) seemed to call EndDocumentHandler() in pdfrenderer.cpp
However it looks like TessPDFRenderer::EndDocumentHandler() is never being invoked.
Last edit: Panneer 2015-02-01
Hi Quan,
Tesseract has commented on the issue saying its an issue with Tess4J, not the C API..
https://code.google.com/p/tesseract-ocr/issues/detail?id=1410&can=1&sort=-id
OK, I sandwiched
ProcessPages
withResultRendererBeginDocument
andResultRendererEndDocument
. See if that made a difference.I also have long noticed that the generated PDF by TessAPI1 on Windows would sometimes display a 'GlyphLessFont' message on opening, indicating some trailer bytes did not get written to the file.
Hi Quan,
I have tested your fix in tess4j-2 branch and can confirm that ocr'ed pdf is now closing properly as expected.
Thanks for your help.
Hi Quan,
I am testing 3.0-beta and I found the handler is not properly closed.
Are you planing to port this fix to version 3.0?
Thank you
Hi Vagelis,
I believe this is a regression bug in Tesseract 3.04 beta, which is being bundled with Tess4J 3.0 beta. The bug probably has been fixed in later commits in Tesseract repo. I'm waiting for the availability of the final 3.04 DLL in order to test.
Quan
Hi Quan,
Indeed there was a bug in Tesseract 3.04 which is fixed in 3.04.01dev.
However in method
public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats)
you need to add a call to
api.TessDeleteResultRenderer(renderer);
I did this in the finally block and it works fine.
Also, I believe that calls to TessResultRendererBeginDocument() and TessResultRendererEndDocument() are not needed because tesseract does both inside the ProcessPages.
Hi Vagelis,
I based the code on the example provided by a Tesseract developer:
Tesseract-OCR example of C-API: produce pdf output
It probably has changed substantially since they refactored the
ResultRenderer
API.I found calling
TessDeleteResultRenderer
would immediately crash the JVM, so it has been commented out for now as shown in the repo. Please share your code change.Thank you.
Quan
Last edit: Quan Nguyen 2015-07-30
This appears to be an issue with tess4j 3.0.0. II am running under windows 7 - 64 bit and java 1.8.0_71-b15. I noticed that the files were open as subsequent processing of files led to errors as the files had not been closed. I added TessDeleteResultRenderer as shown below, and things appear to work fine. Would like to see this fix applied.
Thanks for reporting. Will investigate the issue.
On reviewing the code, a better fix would be to delete the renderers in the method that allocated them.
Patch attached below.
diff --git a/src/main/java/net/sourceforge/tess4j/Tesseract.java b/src/main/java/net/sourceforge/tess4j/Tesseract.java
index 05227f6..c1c4803 100644
--- a/src/main/java/net/sourceforge/tess4j/Tesseract.java
+++ b/src/main/java/net/sourceforge/tess4j/Tesseract.java
@@ -537,6 +537,9 @@
+
Fix committed. Thank you.
Could you please share the complete code by which you are getting output in pdf format because I am facing the same issue but I am not able to solve.
There are code examples in the unit test.
For setting output as in pdf format , which config file has to be changed so that output does not come in default( text ) format?
The config files are only for command-line executable. To output PDF, you'll need to use
PDFRenderer
.