gscan2pdf / Bugs / #229 Tesseract's recognized text is not recovered

#229 Tesseract's recognized text is not recovered

Milestone: v1.0_(example)

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2017-04-24

Created: 2016-08-28

Creator: papoteur

Private: No

Version : 1.2.5

Open a PDF file
Ask for an OCR with tesseract.
The page is processed, but the "OCR tab" remains void.
In log, I get the corersponding lines (I replaced the source file name with <myfile.pdf>) :</myfile.pdf>

INFO - 1 pages
INFO - pdfimages -f 1 -l 1 "<myfile.pdf>" x
INFO - New page filename x-000.ppm, format Portable pixmap format (color)
INFO - New page filename /tmp/gscan2pdf-Lq1D/g9C4aGQW6w.png, format Portable Network Graphics
INFO - Added /tmp/gscan2pdf-Lq1D/pErqBf8Gs0.png at page 1 with resolution 199.950168350168
DEBUG - Started setting page_number_start from 1 to 2
DEBUG - Finished setting page_number_start from 1 to 2
INFO - Found tesseract version 3.02.02.
INFO - echo tessedit_create_hocr 1 > hocr.config;tesseract /tmp/gscan2pdf-Lq1D/pErqBf8Gs0.png /tmp/2hXGRZlhDI -l fra +hocr.config;rm hocr.config
DEBUG - Warnings from Tesseract: Tesseract Open Source OCR Engine v3.02.02 with Leptonica</myfile.pdf>

INFO - Replaced /tmp/gscan2pdf-Lq1D/pErqBf8Gs0.png at page 1 with /tmp/gscan2pdf-Lq1D/wFB4NSdVrc.png, resolution 199.950168350168

When I launch
echo tessedit_create_hocr 1 > hocr.config;tesseract /tmp/gscan2pdf-Lq1D/wFB4NSdVrc.png /tmp/2hXGRZlhDI -l fra +hocr.config
I get /tmp/2hXGRZlhDI.html file with the good content.

The place seems not good, perhaps related to:
https://sourceforge.net/p/gscan2pdf/bugs/202/
Is the .html and the end expected?

Papoteur

Discussion

papoteur - 2016-08-28

Still valid in 1.5.2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2017-04-19

Apologies for the late response. Tesseract works with gscan2pdf for me. Can you post an example PDF that reproduces the problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

papoteur - 2017-04-21

Hello,
This is no more valid in Mageia 6/cauldron with 1.7.2 release.
Thus, we can close, although Mageia 5 is still a maintained release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2017-04-24

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tesseract's recognized text is not recovered

Group

Searches

Help

#229 Tesseract's recognized text is not recovered

Discussion