Menu

#7 Different HOCR result than from tesseract.exe

None
closed
hocr (1)
5
2014-10-13
2013-11-28
4F2E4A2E
No

Hi!
I am trying to get my head around this and i hope you can help me.
!= hocr result from:

  • tess4j-api vs. tess4j-package
  • tess4j-api vs. tesseract.exe cmd (>tesseract.exe img.tif out hocr)

tess4j-api vs. tess4j-package

It seams imho right now to create hocr out of images and then converts them into a pdf searchable, since the hocr result beeing returned by libtesseract302.dll is another ...

May the "liblept168.dll" or "libtesseract302.dll" is an older version or what else could be the reason for different results, since the method is just doing what it should:

  /**
  * Gets recognized text.
  */
  protected String getOCRText() {
  Pointer utf8Text = hocr ? TessBaseAPIGetHOCRText(handle, pageNum - 1) : TessBaseAPIGetUTF8Text(handle);
  String str = utf8Text.getString(0);
  TessDeleteText(utf8Text);
  return str;
  }

Thank you in advance!
cheers

1 Attachments

Discussion

  • Quan Nguyen

    Quan Nguyen - 2013-11-28

    The DLL bundled with Tess4J v1.2 is based on Tesseract r866; and v1.1 based on r828. Between the revisions, there are some hOCR fixes incorporated. That explains the differences you see in the outputs.

    http://code.google.com/p/tesseract-ocr/source/list?num=25&start=866

     

    Last edit: Quan Nguyen 2013-11-28
  • 4F2E4A2E

    4F2E4A2E - 2013-11-28

    So what stands in the way of updating it to the newest version? Which dependency stands in the way? Or just lack of tests and time?

     
  • Quan Nguyen

    Quan Nguyen - 2013-11-28

    Both revisions can be considered belonging to the same Tesseract version 3.02.x. After r866, there have been significant commits to the repository of codes for upcoming version 3.03. There could be some breaking changes to the TessBaseAPI.

     
  • 4F2E4A2E

    4F2E4A2E - 2013-11-28

    So when do you consider to update? Is now the phase to implement lots of junit cases so an update to further newer version can be done easier?

     
  • Quan Nguyen

    Quan Nguyen - 2013-11-28

    Until Tesseract folks officially release 3.03. Because the JNA that Tess4J uses depends on Tesseract's CAPI to work. We want ensure their compatibility.

     
  • 4F2E4A2E

    4F2E4A2E - 2013-11-28

    Great! Thank you!

    I just did find out that the baseApi from tesseract-ocr does not support the complete hocr-html, so it's imho impossible for tess4j to deliver the same hocr-html.

    Although tess4j could have it optional and offer the possibility to set the html-tags outside the body.

    FYI: http://code.google.com/p/tesseract-ocr/issues/detail?id=1028

     

    Last edit: 4F2E4A2E 2013-11-28
  • Quan Nguyen

    Quan Nguyen - 2013-12-02

    htmlBeginTag and htmlEndTag constant fields contain values extracted from Tesseract codebase. Tess4J should produce the same hOCR string as Tesseract for the same given version.

     
  • 4F2E4A2E

    4F2E4A2E - 2014-09-05

    Problem solved:

    Fixed you can get complete hocr-html (and pdf ;-) ) with current code (r1040).
    Example how to use it with C-API can be found at tesseract-dev forum[1].

    For more information see: https://code.google.com/p/tesseract-ocr/issues/detail?id=1028#c7

    Issue can be closed, thank you!

     
  • Quan Nguyen

    Quan Nguyen - 2014-09-05
    • status: open --> closed
     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB