Menu

#1 Lithuanian symbols are not recognized

v1.0 (example)
closed
nobody
None
5
2016-08-08
2013-03-02
Donatas G.
No

I tried to use "-lang lit" option but special lithuanian characters ąžūęėų are not recognized by the engine. Tessceract Lithuanian package works, so the problem must be elsewhere.

Here is the command I used:

pdfsandwich -lang lit RL2012_opt_pix_213+sketch.pdf

Discussion

  • Donatas G.

    Donatas G. - 2013-03-02

    This is a file treated with pdfsandwitch

     
  • Donatas G.

    Donatas G. - 2013-03-23

    Another interesting thing: the temporary files produced when running the above command on the above pdf file does produce normal temporary Lithuanian html (hocr) files with the Lithuanian characters.

     
  • Tobias Elze

    Tobias Elze - 2015-07-09

    Does this still happen with the latest version of pdfsandwich?

     
  • Donatas G.

    Donatas G. - 2015-07-13

    Hi, yes, I have just tested the newest deb you provide, and it still does not recognize the characters mentioned in the bug report. Spaces are there instead of those characters.

     
  • glupender

    glupender - 2015-10-02

    I have the exact same problem with cyrrillic characters. Tried various versions v ghoscript 8/9 no luck.

     
  • Tobias Elze

    Tobias Elze - 2015-10-13

    Hi Donatas,

    Initially, I assumed a ghostscript bug, but actually tesseract is to blame. Here is a detailed description:

    http://bugs.ghostscript.com/show_bug.cgi?id=695869

    In short, tesseract messes up with unconventional encodings. If a pdf file was generated by tesseract, in most cases, it can be viewed by pdf viewers and text can be extracted. However, if it is processed by other software, such as ghostscript, the broken fonts may easily mess up the resulting output files, which happens in your case.

    That's quite a big problem which is obviously beyond what I can fix. I hope that tesseract can solve it. Let me think if I find some temporary solution - right now I don't know one yet.

     
  • Donatas G.

    Donatas G. - 2015-10-14

    There is another program that, using certain parameters, does the things this program does, and it does embed the font in a way that Lithuanian characters are not lost. Maybe clues might be taken from them? The program is k2pdfopt and the command to ocr file and embed characters is:
    k2pdfopt -mode copy -o %s_ocr -ocrlang lit -ocr t -as -odpi 300 file-to-ocr.pdf
    (you might have to specify the variable TESSDATA_PREFIX using
    export TESSDATA_PREFIX=/path/to/tessdata/parent/folder

     
  • Tobias Elze

    Tobias Elze - 2015-10-14

    Thanks, I didn't know this one yet. Seems these guys completely re-compose the pdfs, which might be kind of overkill to solve a font problem. But I'll have a look at it.

    The reason why I currently need ghostscript is to work around another tesseract bug: tesseract often messes up with the page size, so that I read out the original page size before running tesseract, and re-create the pdf with the original page size after tesseract using ghostscript. If we would skip the latter, we could easily replace ghostscript. But then we need to live with typically oversized pdfs.

    Anyway, it's actually not a ghostscript bug. It's tesseract which messes up the pdf, even in two ways: page size and font. They are working on it. I fixed the page size problem using ghostscript, as described above. Let me see if I find a contemporary workaround for the font issue as well.

     
  • Tobias Elze

    Tobias Elze - 2016-08-05

    Ghostscript is replaced in version 0.1.5 now. Does that solve the problem?

     
    • Donatas G.

      Donatas G. - 2016-08-08

      Yes, the Lithuanian chars are now recognized. Thank you.

       
  • Tobias Elze

    Tobias Elze - 2016-08-08
    • status: open --> closed
    • Group: --> v1.0 (example)
     

Log in to post a comment.