Different HOCR result than from tesseract.exe

Brought to you by: nguyenq

#7 Different HOCR result than from tesseract.exe

Milestone: None

Status: closed

Owner: Quan Nguyen

Labels: hocr (1)

Priority: 5

Updated: 2014-10-13

Created: 2013-11-28

Creator: 4F2E4A2E

Private: No

Hi!
I am trying to get my head around this and i hope you can help me.
!= hocr result from:

tess4j-api vs. tess4j-package
tess4j-api vs. tesseract.exe cmd (>tesseract.exe img.tif out hocr)

tess4j-api vs. tess4j-package

It seams imho right now to create hocr out of images and then converts them into a pdf searchable, since the hocr result beeing returned by libtesseract302.dll is another ...

May the "liblept168.dll" or "libtesseract302.dll" is an older version or what else could be the reason for different results, since the method is just doing what it should:

  /**
  * Gets recognized text.
  */
  protected String getOCRText() {
  Pointer utf8Text = hocr ? TessBaseAPIGetHOCRText(handle, pageNum - 1) : TessBaseAPIGetUTF8Text(handle);
  String str = utf8Text.getString(0);
  TessDeleteText(utf8Text);
  return str;
  }

Thank you in advance!
cheers

1 Attachments

2013-11-27_16_54_40-Beyond Compare.png

Discussion

4F2E4A2E - 2013-11-28

Adding same image without whitespace, since i am not able to edit the ticket after posting it.

Just updated the images for better explanation of whats going on:

Last edit: 4F2E4A2E 2013-11-28

2013-11-28_11_24_57-eurotext_tess4j.html_--_eurotext_tesseract-exe.html-TextCompare-BeyondCo.png

2013-11-28_11_27_45-eurotext_tess4j.html_--_eurotext_tesseract-exe.html-TextCompare-BeyondCo.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Quan Nguyen - 2013-11-28

The DLL bundled with Tess4J v1.2 is based on Tesseract r866; and v1.1 based on r828. Between the revisions, there are some hOCR fixes incorporated. That explains the differences you see in the outputs.

http://code.google.com/p/tesseract-ocr/source/list?num=25&start=866

Last edit: Quan Nguyen 2013-11-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

4F2E4A2E - 2013-11-28

So what stands in the way of updating it to the newest version? Which dependency stands in the way? Or just lack of tests and time?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Quan Nguyen - 2013-11-28

Both revisions can be considered belonging to the same Tesseract version 3.02.x. After r866, there have been significant commits to the repository of codes for upcoming version 3.03. There could be some breaking changes to the TessBaseAPI.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

4F2E4A2E - 2013-11-28

So when do you consider to update? Is now the phase to implement lots of junit cases so an update to further newer version can be done easier?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Quan Nguyen - 2013-11-28

Until Tesseract folks officially release 3.03. Because the JNA that Tess4J uses depends on Tesseract's CAPI to work. We want ensure their compatibility.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

4F2E4A2E - 2013-11-28

Great! Thank you!

I just did find out that the baseApi from tesseract-ocr does not support the complete hocr-html, so it's imho impossible for tess4j to deliver the same hocr-html.

Although tess4j could have it optional and offer the possibility to set the html-tags outside the body.

FYI: http://code.google.com/p/tesseract-ocr/issues/detail?id=1028

Last edit: 4F2E4A2E 2013-11-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Quan Nguyen - 2013-12-02

htmlBeginTag and htmlEndTag constant fields contain values extracted from Tesseract codebase. Tess4J should produce the same hOCR string as Tesseract for the same given version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

4F2E4A2E - 2014-09-05

Problem solved:

Fixed you can get complete hocr-html (and pdf ;-) ) with current code (r1040).
Example how to use it with C-API can be found at tesseract-dev forum[1].

For more information see: https://code.google.com/p/tesseract-ocr/issues/detail?id=1028#c7

Issue can be closed, thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Quan Nguyen - 2014-09-05

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous