gscan2pdf add CR after each word using Tesseract

Brought to you by: ra28145

#70 gscan2pdf add CR after each word using Tesseract

Milestone: v1.0_(example)

Status: closed

Owner: nobody

Labels: tesseract (2)

Priority: 5

Updated: 2025-11-07

Created: 2025-09-17

Creator: Pascal

Private: No

Hello everyone,
I use gscan2pdf in conjunction with Tesseract. I find the character recognition to be pretty good overall (though it won't be better than Abbyy Fine Reader...). The problem is that most of the time, a carriage return is added after each word. Here's an example:

after scanning in gscan2pdf
https://2plz.fr/lutim/tQNQcOze/1PeStm1c.png
after recognition by Tesseract ("Text Layer" tab)
https://2plz.fr/lutim/093FswI6/7cq6KpWJ.png
after exporting to PDF, opening in Okular, then selecting the text:
https://2plz.fr/lutim/1Jbpk4qG/qwvBIrKB.png
after copying the text in Okular, pasting it into a text editor:

Conformément
à l’article
12 du
Règlement du Fonds, le Fonds a procédé à sa deuxième
distribution.
Ce deuxième
remboursement
de capital
s'élève
à
un
montant
de
5.74
€
par part,
soit
5,74 % du nominal
investi.

However,
if I do an OCR with Tesseract directly from the command line
tesseract test.tif test.txt -l fra
results is how I expected and is excellent:

Conformément à l’article 12 du Règlement du Fonds, le Fonds a procédé à sa deuxième distribution.
Ce deuxième remboursement de capital s'élève à un montant de 5.74 € par part, soit 5,74 % du nominal

investi.

Here some information about my configuration:

pascal@pascal-Latitude-5580:~$ lsb_release -a 
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS 
Release: 24.04 
Codename: noble

pascal@pascal-Latitude-5580:~$ tesseract --version
tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
 Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7

The gscan2pdf version is 2.13.4.
Do you have any idea what's going on?
Thank you in advance for any help you can provide,
Pascal

Discussion

Pascal - 2025-10-21

Hello,
nobody here ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2025-10-21

Apologies for the lack of response.

gscan2pdf also offers a "Save as text" option. Does that do a better job?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pascal - 2025-11-01
  
  Hello Jeffrey,
  thanks a lot for your reply. No worry :)
  Fatality I just scanned a bunch of documents now and I tested to "save as text". Indeed, the job is great compared to the same document in pdf. Very weird...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2025-11-01

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2025-11-01

Glad you could find a solution

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pascal - 2025-11-01

Thanks, but it's not a solution for me. The best would be that when I select all the text in a pdf and then past in an other document, it would keep the same number of line feed...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2025-11-02

OK. I misunderstood. But it is going to be difficult for me to influence how Okular formats text it places into the clipboard.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pascal - 2025-11-02

Hello,
IMHO, I think the problem doesn't come from Okular.
Following my given example, if you look into the gscan2pdf OCR recognition tab:
https://2plz.fr/lutim/gallery#TggvwUqP/DKILVETY.png
Almost each word are "separated" so that gives a line feed after them.
It gives this text if you copy / past from the generated pdf:

Conformément à l’article 12 du Règlement du Fonds, le Fonds a procédé à sa deuxième distribution. Ce deuxième remboursement de capital s'élève à un montant de 5.74 € par part, soit 5,74 % du nominal investi.

It is definitely not the same thing as the raw text output that is close to the reality

Last edit: Pascal 2025-11-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pascal - 2025-11-05

Hello,
to be clearer I think it would be better to make small test files. So here they are. I wrote a small text with Writer and then exported it to pdf: Test_gscan2pdf.pdf
Then, I converted it to jpg file (with The Gimp): Test_gscan2pdf.jpg
Then, imported into gscan2pdf, did OCR, then exported again to pdf: Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf
Then, open the original pdf (Test_gscan2pdf.pdf ), select first paragraph and past here:

“A Hare one day ridiculed the short feet and slow pace of the Tortoise, who replied, laughing: “Though you be swift as the wind, I will beat you in a race.”

Finally, open the generated pdf (Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf), select first paragraph and past here:

“A Hare one day ridiculed the short feet and slow pace of the Tortoise, who replied, laughing: “Though you be swift as the wind, I will beat you in a race.”

For me and IMHO, the two texts should be identical...

Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf

Test_gscan2pdf.jpg

Test_gscan2pdf.pdf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2025-11-07

OK. But it is Okular that is inserting the CR characters, not gscan2pdf.

The difference is the formatting. When Writer created the PDF, it created a single text box. Okular can see that it is all one text box, and gives you the text you expect This was lost when converting to JPG. OCR created a box per word, in order to get the word positions correct. OCR does not give much hint of the fonts used, so these must be guessed.

It would be possible to embed the text in the PDF differently, but then the positions would be wrong.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.