Tess4J / Discussion / Help: Searchable PDF output using Tess4j 2.0 Beta

Panneer - 2015-01-28

Hi,

I am using Tess4j 2.0 Beta to create a searchable PDF as output. PDF is getting created, however the PDF is locked and file is open until JVM terminates. How do I force PDF to close as soon as completing OCR?

Sample Code:

Tesseract tessaractInstance = Tesseract.getInstance(); List<RenderedFormat> list = new ArrayList<RenderedFormat>(); list.add(RenderedFormat.PDF); tessaractInstance.setLanguage("eng"); tessaractInstance.setDatapath("D:/temp/Tess4j2.0Example"); tessaractInstance.createDocuments( pdfFile.getAbsolutePath(),"D:/temp/ocr", list);

Last edit: Quan Nguyen 2015-01-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-01-29

The PDF creation is handled by Tesseract, so it seems that the file handle is still not released by it. The PDF code was improved in Tesseract 3.04, so you may want to try that version. If the behavior remains the same, you may want to consider submitting an issue with them.

https://code.google.com/p/tesseract-ocr/issues/list

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-01-29

Thanks for the response.

Where would I find DLLs for Tesseract 3.04?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-01-29

They're available in the repository.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-01-29

Thanks again. The behaviour is same in Tesseract 3.0.4, where file handler is not released. Looks like i may have to raise an issue with Tesseract.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-01-29

Raised an issue with Tesseract

https://code.google.com/p/tesseract-ocr/issues/detail?id=1410

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-01-30

Thank you, Panneer.

How do you determine that a file is locked or open?

I look through Tesseract's pdfrenderer.cpp file, it has a few fclose but I'm not sure the PDF would be closed by that.

Quan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-02-01

My code tries to open the converted PDF after OCR, thats where i found pdf is locked.

Also, PDF trailer chars (%%%%EOF\n) is getting appended to the output pdf only after terminating the pdf.

I also invoked TessResultRendererEndDocument() method in TessAPI, since this method (in renderer.cpp) seemed to call EndDocumentHandler() in pdfrenderer.cpp

However it looks like TessPDFRenderer::EndDocumentHandler() is never being invoked.

Last edit: Panneer 2015-02-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-02-10

Hi Quan,

Tesseract has commented on the issue saying its an issue with Tess4J, not the C API..

https://code.google.com/p/tesseract-ocr/issues/detail?id=1410&can=1&sort=-id

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-02-11

OK, I sandwiched ProcessPages with ResultRendererBeginDocument and ResultRendererEndDocument. See if that made a difference.

I also have long noticed that the generated PDF by TessAPI1 on Windows would sometimes display a 'GlyphLessFont' message on opening, indicating some trailer bytes did not get written to the file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Panneer - 2015-02-15

Hi Quan,

I have tested your fix in tess4j-2 branch and can confirm that ocr'ed pdf is now closing properly as expected.

Thanks for your help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vagelis Giannadakis - 2015-07-22

Hi Quan,

I am testing 3.0-beta and I found the handler is not properly closed.

Are you planing to port this fix to version 3.0?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-07-25

Hi Vagelis,

I believe this is a regression bug in Tesseract 3.04 beta, which is being bundled with Tess4J 3.0 beta. The bug probably has been fixed in later commits in Tesseract repo. I'm waiting for the availability of the final 3.04 DLL in order to test.

Quan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vagelis Giannadakis - 2015-07-29

Hi Quan,

Indeed there was a bug in Tesseract 3.04 which is fixed in 3.04.01dev.

However in method
public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats)
you need to add a call to
api.TessDeleteResultRenderer(renderer);
I did this in the finally block and it works fine.

Also, I believe that calls to TessResultRendererBeginDocument() and TessResultRendererEndDocument() are not needed because tesseract does both inside the ProcessPages.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2015-07-30

Hi Vagelis,

I based the code on the example provided by a Tesseract developer:

Tesseract-OCR example of C-API: produce pdf output

It probably has changed substantially since they refactored the ResultRenderer API.

I found calling TessDeleteResultRenderer would immediately crash the JVM, so it has been commented out for now as shown in the repo. Please share your code change.

Thank you.

Quan

Last edit: Quan Nguyen 2015-07-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Viraf - 2016-02-04

This appears to be an issue with tess4j 3.0.0. II am running under windows 7 - 64 bit and java 1.8.0_71-b15. I noticed that the files were open as subsequent processing of files led to errors as the files had not been closed. I added TessDeleteResultRenderer as shown below, and things appear to work fine. Would like to see this fix applied.

private void createDocuments(String filename, TessResultRenderer renderer) throws TesseractException { api.TessBaseAPISetInputName(handle, filename); //for reading a UNLV zone file int result = api.TessBaseAPIProcessPages(handle, filename, null, 0, renderer); api.TessDeleteResultRenderer(renderer); if (result == ITessAPI.FALSE) { throw new TesseractException("Error during processing page."); } }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2016-02-05

Thanks for reporting. Will investigate the issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Viraf - 2016-02-05

On reviewing the code, a better fix would be to delete the renderers in the method that allocated them.
Patch attached below.

diff --git a/src/main/java/net/sourceforge/tess4j/Tesseract.java b/src/main/java/net/sourceforge/tess4j/Tesseract.java
index 05227f6..c1c4803 100644
--- a/src/main/java/net/sourceforge/tess4j/Tesseract.java
+++ b/src/main/java/net/sourceforge/tess4j/Tesseract.java
@@ -537,6 +537,9 @@

TessResultRenderer renderer = createRenderers(outputbases[i], formats); createDocuments(filename, renderer);

api.TessDeleteResultRenderer(renderer);
+

} catch (Exception e) { // skip the problematic image file logger.error(e.getMessage(), e);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2016-02-06

Fix committed. Thank you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rahul Dubey - 2018-01-23

Could you please share the complete code by which you are getting output in pdf format because I am facing the same issue but I am not able to solve.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2018-01-25

There are code examples in the unit test.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rahul Dubey - 2018-01-25

For setting output as in pdf format , which config file has to be changed so that output does not come in default( text ) format?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2018-01-25

The config files are only for command-line executable. To output PDF, you'll need to use PDFRenderer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Searchable PDF output using Tess4j 2.0 Beta

Forums

Help

Searchable PDF output using Tess4j 2.0 Beta document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Searchable PDF output using Tess4j 2.0 Beta