Different HOCR result than from tesseract.exe
Brought to you by:
nguyenq
Hi!
I am trying to get my head around this and i hope you can help me.
!= hocr result from:

It seams imho right now to create hocr out of images and then converts them into a pdf searchable, since the hocr result beeing returned by libtesseract302.dll is another ...
May the "liblept168.dll" or "libtesseract302.dll" is an older version or what else could be the reason for different results, since the method is just doing what it should:
/**
* Gets recognized text.
*/
protected String getOCRText() {
Pointer utf8Text = hocr ? TessBaseAPIGetHOCRText(handle, pageNum - 1) : TessBaseAPIGetUTF8Text(handle);
String str = utf8Text.getString(0);
TessDeleteText(utf8Text);
return str;
}
Thank you in advance!
cheers
Anonymous
Adding same image without whitespace, since i am not able to edit the ticket after posting it.
Just updated the images for better explanation of whats going on:
Last edit: 4F2E4A2E 2013-11-28
The DLL bundled with Tess4J v1.2 is based on Tesseract r866; and v1.1 based on r828. Between the revisions, there are some hOCR fixes incorporated. That explains the differences you see in the outputs.
http://code.google.com/p/tesseract-ocr/source/list?num=25&start=866
Last edit: Quan Nguyen 2013-11-28
So what stands in the way of updating it to the newest version? Which dependency stands in the way? Or just lack of tests and time?
Both revisions can be considered belonging to the same Tesseract version 3.02.x. After r866, there have been significant commits to the repository of codes for upcoming version 3.03. There could be some breaking changes to the TessBaseAPI.
So when do you consider to update? Is now the phase to implement lots of junit cases so an update to further newer version can be done easier?
Until Tesseract folks officially release 3.03. Because the JNA that Tess4J uses depends on Tesseract's CAPI to work. We want ensure their compatibility.
Great! Thank you!
I just did find out that the baseApi from tesseract-ocr does not support the complete hocr-html, so it's imho impossible for tess4j to deliver the same hocr-html.
Although tess4j could have it optional and offer the possibility to set the html-tags outside the body.
FYI: http://code.google.com/p/tesseract-ocr/issues/detail?id=1028
Last edit: 4F2E4A2E 2013-11-28
htmlBeginTagandhtmlEndTagconstant fields contain values extracted from Tesseract codebase. Tess4J should produce the same hOCR string as Tesseract for the same given version.Problem solved:
For more information see: https://code.google.com/p/tesseract-ocr/issues/detail?id=1028#c7
Issue can be closed, thank you!