Hello everyone,
I use gscan2pdf in conjunction with Tesseract. I find the character recognition to be pretty good overall (though it won't be better than Abbyy Fine Reader...). The problem is that most of the time, a carriage return is added after each word. Here's an example:
Conformément
à l’article
12 du
Règlement du Fonds, le Fonds a procédé à sa deuxième
distribution.
Ce deuxième
remboursement
de capital
s'élève
à
un
montant
de
5.74
€
par part,
soit
5,74 % du nominal
investi.
However,
if I do an OCR with Tesseract directly from the command line
tesseract test.tif test.txt -l fra
results is how I expected and is excellent:
Conformément à l’article 12 du Règlement du Fonds, le Fonds a procédé à sa deuxième distribution.
Ce deuxième remboursement de capital s'élève à un montant de 5.74 € par part, soit 5,74 % du nominalinvesti.
Here some information about my configuration:
pascal@pascal-Latitude-5580:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
pascal@pascal-Latitude-5580:~$ tesseract --version
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7
The gscan2pdf version is 2.13.4.
Do you have any idea what's going on?
Thank you in advance for any help you can provide,
Pascal
Hello,
nobody here ?
Apologies for the lack of response.
gscan2pdf also offers a "Save as text" option. Does that do a better job?
Hello Jeffrey,
thanks a lot for your reply. No worry :)
Fatality I just scanned a bunch of documents now and I tested to "save as text". Indeed, the job is great compared to the same document in pdf. Very weird...
Glad you could find a solution
Thanks, but it's not a solution for me. The best would be that when I select all the text in a pdf and then past in an other document, it would keep the same number of line feed...
OK. I misunderstood. But it is going to be difficult for me to influence how Okular formats text it places into the clipboard.
Hello,
IMHO, I think the problem doesn't come from Okular.
Following my given example, if you look into the gscan2pdf OCR recognition tab:
https://2plz.fr/lutim/gallery#TggvwUqP/DKILVETY.png
Almost each word are "separated" so that gives a line feed after them.
It gives this text if you copy / past from the generated pdf:
It is definitely not the same thing as the raw text output that is close to the reality
Last edit: Pascal 2025-11-02
Hello,
to be clearer I think it would be better to make small test files. So here they are. I wrote a small text with Writer and then exported it to pdf: Test_gscan2pdf.pdf
Then, I converted it to jpg file (with The Gimp): Test_gscan2pdf.jpg
Then, imported into gscan2pdf, did OCR, then exported again to pdf: Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf
Then, open the original pdf (Test_gscan2pdf.pdf ), select first paragraph and past here:
Finally, open the generated pdf (Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf), select first paragraph and past here:
For me and IMHO, the two texts should be identical...
OK. But it is Okular that is inserting the CR characters, not gscan2pdf.
The difference is the formatting. When Writer created the PDF, it created a single text box. Okular can see that it is all one text box, and gives you the text you expect This was lost when converting to JPG. OCR created a box per word, in order to get the word positions correct. OCR does not give much hint of the fonts used, so these must be guessed.
It would be possible to embed the text in the PDF differently, but then the positions would be wrong.