pdfsandwich / Bugs / #1 Lithuanian symbols are not recognized

#1 Lithuanian symbols are not recognized

Milestone: v1.0 (example)

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2016-08-08

Created: 2013-03-02

Creator: Donatas G.

Private: No

I tried to use "-lang lit" option but special lithuanian characters ąžūęėų are not recognized by the engine. Tessceract Lithuanian package works, so the problem must be elsewhere.

Here is the command I used:

pdfsandwich -lang lit RL2012_opt_pix_213+sketch.pdf

Discussion

Donatas G. - 2013-03-02

This is a file treated with pdfsandwitch

RL2012_opt_pix_213+sketch_ocr.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donatas G. - 2013-03-23

Another interesting thing: the temporary files produced when running the above command on the above pdf file does produce normal temporary Lithuanian html (hocr) files with the Lithuanian characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-07-09

Does this still happen with the latest version of pdfsandwich?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donatas G. - 2015-07-13

Hi, yes, I have just tested the newest deb you provide, and it still does not recognize the characters mentioned in the bug report. Spaces are there instead of those characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

glupender - 2015-10-02

I have the exact same problem with cyrrillic characters. Tried various versions v ghoscript 8/9 no luck.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-10-13

Hi Donatas,

Initially, I assumed a ghostscript bug, but actually tesseract is to blame. Here is a detailed description:

http://bugs.ghostscript.com/show_bug.cgi?id=695869

In short, tesseract messes up with unconventional encodings. If a pdf file was generated by tesseract, in most cases, it can be viewed by pdf viewers and text can be extracted. However, if it is processed by other software, such as ghostscript, the broken fonts may easily mess up the resulting output files, which happens in your case.

That's quite a big problem which is obviously beyond what I can fix. I hope that tesseract can solve it. Let me think if I find some temporary solution - right now I don't know one yet.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donatas G. - 2015-10-14

There is another program that, using certain parameters, does the things this program does, and it does embed the font in a way that Lithuanian characters are not lost. Maybe clues might be taken from them? The program is k2pdfopt and the command to ocr file and embed characters is:
k2pdfopt -mode copy -o %s_ocr -ocrlang lit -ocr t -as -odpi 300 file-to-ocr.pdf
(you might have to specify the variable TESSDATA_PREFIX using
export TESSDATA_PREFIX=/path/to/tessdata/parent/folder

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2015-10-14

Thanks, I didn't know this one yet. Seems these guys completely re-compose the pdfs, which might be kind of overkill to solve a font problem. But I'll have a look at it.

The reason why I currently need ghostscript is to work around another tesseract bug: tesseract often messes up with the page size, so that I read out the original page size before running tesseract, and re-create the pdf with the original page size after tesseract using ghostscript. If we would skip the latter, we could easily replace ghostscript. But then we need to live with typically oversized pdfs.

Anyway, it's actually not a ghostscript bug. It's tesseract which messes up the pdf, even in two ways: page size and font. They are working on it. I fixed the page size problem using ghostscript, as described above. Let me see if I find a contemporary workaround for the font issue as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2016-08-05

Ghostscript is replaced in version 0.1.5 now. Does that solve the problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Donatas G. - 2016-08-08
  
  Yes, the Lithuanian chars are now recognized. Thank you.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tobias Elze - 2016-08-08

status: open --> closed

Group: --> v1.0 (example)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.