Menu

#386 gscan2pdf needs help parsing output from opencl-enabled tesseract

v1.0_(example)
closed-fixed
nobody
None
5
2021-05-06
2021-05-04
No

I've am working with gscan2pdf version 2.12.1. I recently recompiled tesseract with support for OpenCL. When OpenCL is enabled, gscan2pdf will not correctly detect the languages supported by tesseract version 4.1.1.

when i run:

tesseract --list-langs

from the command line, tesseract returns:

[DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.399570
[DS] Selected Device[1]: "(null)" (Native)
List of available languages (3):
eng
enm
osd

( NB those lines beginning with [DS] are not present unless OpenCL support is enabled)

when i start gscan2pdf from the command line with --log=/path/to/logfile, the resulting log contains:

INFO - tesseract -v
INFO - Found tesseract version v4.1.1.
INFO - tesseract --list-langs
INFO - **Found tesseract language** [DS] Profile read from file (tesseract_opencl_profile_devices.dat). ([DS] Profile read from file (tesseract_opencl_profile_devices.dat).)
INFO - **Found tesseract language** [DS] Device[1] 0:(null) score is 0.399570 ([DS] Device[1] 0:(null) score is 0.399570)
INFO - **Found tesseract language** [DS] Selected Device[1]: "(null)" (Native) ([DS] Selected Device[1]: "(null)" (Native))
WARN - You are using locale 'en_US.UTF-8'. Please install tesseract package 'tesseract-ocr-eng' and restart gscan2pdf for OCR for English with tesseract.

if i choose 'OK' on the warning message dialog:

Warning: missing packages
You are using locale 'en_US.UTF-8'. Please install tesseract package 'tesseract-ocr-eng' and restart gscan2pdf for OCR for English with tesseract.

and initiate an OCR process using gscan2pdf, the OCR setup dialog offers an option to detect one of three languages:

[DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.399570
[DS] Selected Device[1]: "(null)" (Native)

I made a quick workaround (attached), after which gscan2pdf recognized the languages supported by tesseract and successfully processed a 580-page document very quickly using the OpenCL-enabled tesseract.

I hope you will consider adapting gscan2pdf to support OpenCL-enabled tesseract.

1 Attachments

Discussion

  • Jeffrey Ratcliffe

    Committed. Thanks for the patch.

     
  • Jeffrey Ratcliffe

    • status: open --> closed-fixed
     
  • Uwe Brinkhoff

    Uwe Brinkhoff - 2021-05-06

    sorry - I don't see the fix.

    I have the same problem - and a proposal for a solution.

    The output of the command "tesseract --list-langs" with the language information goes to standard-output, the other output goes to standard-error.

    In the package" Gscan2pdf::Tesseract" in the subroutine "language" the split command tests if $err has content and if positive it use it. This is the information from OpenCL. If the command instead use $out then it gets the information about the language.

    So I exchange in my local copy the variables. New the line looks
    @codes = split /\n/xsm, $out ? $out : $err;
    and tesseract recognise the scan for me. Only an error box pops up with the content of standard-error inclusive the OpenCL output. But tesseract on the commandline delivers the same output.

     

    Last edit: Uwe Brinkhoff 2021-05-06
  • Jeffrey Ratcliffe

    The fix is in the above attachment

    But it was in essence the same as your suggestion. Thanks!

     

Log in to post a comment.

Auth0 Logo