Menu

Searchable PDF output using Tess4j 2.0 Beta

Help
Panneer
2015-01-28
2018-01-25
  • Panneer

    Panneer - 2015-01-28

    Hi,

    I am using Tess4j 2.0 Beta to create a searchable PDF as output. PDF is getting created, however the PDF is locked and file is open until JVM terminates. How do I force PDF to close as soon as completing OCR?

    Sample Code:

    Tesseract tessaractInstance = Tesseract.getInstance();
    List<RenderedFormat> list = new ArrayList<RenderedFormat>();
    list.add(RenderedFormat.PDF);
    tessaractInstance.setLanguage("eng");
    tessaractInstance.setDatapath("D:/temp/Tess4j2.0Example");
    tessaractInstance.createDocuments(
        pdfFile.getAbsolutePath(),"D:/temp/ocr", list);
    
     

    Last edit: Quan Nguyen 2015-01-29
  • Quan Nguyen

    Quan Nguyen - 2015-01-29

    The PDF creation is handled by Tesseract, so it seems that the file handle is still not released by it. The PDF code was improved in Tesseract 3.04, so you may want to try that version. If the behavior remains the same, you may want to consider submitting an issue with them.

    https://code.google.com/p/tesseract-ocr/issues/list

     
  • Panneer

    Panneer - 2015-01-29

    Thanks for the response.

    Where would I find DLLs for Tesseract 3.04?

     
  • Quan Nguyen

    Quan Nguyen - 2015-01-29

    They're available in the repository.

     
  • Panneer

    Panneer - 2015-01-29

    Thanks again. The behaviour is same in Tesseract 3.0.4, where file handler is not released. Looks like i may have to raise an issue with Tesseract.

     
  • Quan Nguyen

    Quan Nguyen - 2015-01-30

    Thank you, Panneer.

    How do you determine that a file is locked or open?

    I look through Tesseract's pdfrenderer.cpp file, it has a few fclose but I'm not sure the PDF would be closed by that.

    Quan

     
  • Panneer

    Panneer - 2015-02-01

    My code tries to open the converted PDF after OCR, thats where i found pdf is locked.

    Also, PDF trailer chars (%%%%EOF\n) is getting appended to the output pdf only after terminating the pdf.

    I also invoked TessResultRendererEndDocument() method in TessAPI, since this method (in renderer.cpp) seemed to call EndDocumentHandler() in pdfrenderer.cpp

    However it looks like TessPDFRenderer::EndDocumentHandler() is never being invoked.

     

    Last edit: Panneer 2015-02-01
  • Quan Nguyen

    Quan Nguyen - 2015-02-11

    OK, I sandwiched ProcessPages with ResultRendererBeginDocument and ResultRendererEndDocument. See if that made a difference.

    I also have long noticed that the generated PDF by TessAPI1 on Windows would sometimes display a 'GlyphLessFont' message on opening, indicating some trailer bytes did not get written to the file.

     
  • Panneer

    Panneer - 2015-02-15

    Hi Quan,

    I have tested your fix in tess4j-2 branch and can confirm that ocr'ed pdf is now closing properly as expected.

    Thanks for your help.

     
  • Vagelis Giannadakis

    Hi Quan,

    I am testing 3.0-beta and I found the handler is not properly closed.

    Are you planing to port this fix to version 3.0?

    Thank you

     
  • Quan Nguyen

    Quan Nguyen - 2015-07-25

    Hi Vagelis,

    I believe this is a regression bug in Tesseract 3.04 beta, which is being bundled with Tess4J 3.0 beta. The bug probably has been fixed in later commits in Tesseract repo. I'm waiting for the availability of the final 3.04 DLL in order to test.

    Quan

     
  • Vagelis Giannadakis

    Hi Quan,

    Indeed there was a bug in Tesseract 3.04 which is fixed in 3.04.01dev.

    However in method
    public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats)
    you need to add a call to
    api.TessDeleteResultRenderer(renderer);
    I did this in the finally block and it works fine.

    Also, I believe that calls to TessResultRendererBeginDocument() and TessResultRendererEndDocument() are not needed because tesseract does both inside the ProcessPages.

     
  • Quan Nguyen

    Quan Nguyen - 2015-07-30

    Hi Vagelis,

    I based the code on the example provided by a Tesseract developer:

    Tesseract-OCR example of C-API: produce pdf output

    It probably has changed substantially since they refactored the ResultRenderer API.

    I found calling TessDeleteResultRenderer would immediately crash the JVM, so it has been commented out for now as shown in the repo. Please share your code change.

    Thank you.

    Quan

     

    Last edit: Quan Nguyen 2015-07-30
  • Viraf

    Viraf - 2016-02-04

    This appears to be an issue with tess4j 3.0.0. II am running under windows 7 - 64 bit and java 1.8.0_71-b15. I noticed that the files were open as subsequent processing of files led to errors as the files had not been closed. I added TessDeleteResultRenderer as shown below, and things appear to work fine. Would like to see this fix applied.

    private void createDocuments(String filename, TessResultRenderer renderer) throws TesseractException {
        api.TessBaseAPISetInputName(handle, filename); //for reading a UNLV zone file
        int result = api.TessBaseAPIProcessPages(handle, filename, null, 0, renderer);
        api.TessDeleteResultRenderer(renderer);
    
        if (result == ITessAPI.FALSE) {
            throw new TesseractException("Error during processing page.");
        }
    }
    
     
  • Quan Nguyen

    Quan Nguyen - 2016-02-05

    Thanks for reporting. Will investigate the issue.

     
  • Viraf

    Viraf - 2016-02-05

    On reviewing the code, a better fix would be to delete the renderers in the method that allocated them.
    Patch attached below.

    diff --git a/src/main/java/net/sourceforge/tess4j/Tesseract.java b/src/main/java/net/sourceforge/tess4j/Tesseract.java
    index 05227f6..c1c4803 100644
    --- a/src/main/java/net/sourceforge/tess4j/Tesseract.java
    +++ b/src/main/java/net/sourceforge/tess4j/Tesseract.java
    @@ -537,6 +537,9 @@

                     TessResultRenderer renderer = createRenderers(outputbases[i], formats);
                     createDocuments(filename, renderer);
    
    • api.TessDeleteResultRenderer(renderer);
      +
    •          } catch (Exception e) {
                   // skip the problematic image file
                   logger.error(e.getMessage(), e);
      
     
  • Quan Nguyen

    Quan Nguyen - 2016-02-06

    Fix committed. Thank you.

     
  • Rahul Dubey

    Rahul Dubey - 2018-01-23

    Could you please share the complete code by which you are getting output in pdf format because I am facing the same issue but I am not able to solve.

     
  • Quan Nguyen

    Quan Nguyen - 2018-01-25

    There are code examples in the unit test.

     
  • Rahul Dubey

    Rahul Dubey - 2018-01-25

    For setting output as in pdf format , which config file has to be changed so that output does not come in default( text ) format?

     
  • Quan Nguyen

    Quan Nguyen - 2018-01-25

    The config files are only for command-line executable. To output PDF, you'll need to use PDFRenderer.

     

Log in to post a comment.