Menu

#70 gscan2pdf add CR after each word using Tesseract

v1.0_(example)
closed
nobody
tesseract (2)
5
2025-11-07
2025-09-17
Pascal
No

Hello everyone,
I use gscan2pdf in conjunction with Tesseract. I find the character recognition to be pretty good overall (though it won't be better than Abbyy Fine Reader...). The problem is that most of the time, a carriage return is added after each word. Here's an example:

However,
if I do an OCR with Tesseract directly from the command line
tesseract test.tif test.txt -l fra
results is how I expected and is excellent:

Conformément à l’article 12 du Règlement du Fonds, le Fonds a procédé à sa deuxième distribution.
Ce deuxième remboursement de capital s'élève à un montant de 5.74 € par part, soit 5,74 % du nominal

investi.

Here some information about my configuration:

pascal@pascal-Latitude-5580:~$ lsb_release -a 
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS 
Release: 24.04 
Codename: noble
pascal@pascal-Latitude-5580:~$ tesseract --version
tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
 Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7

The gscan2pdf version is 2.13.4.
Do you have any idea what's going on?
Thank you in advance for any help you can provide,
Pascal

Discussion

  • Pascal

    Pascal - 2025-10-21

    Hello,
    nobody here ?

     
  • Jeffrey Ratcliffe

    Apologies for the lack of response.

    gscan2pdf also offers a "Save as text" option. Does that do a better job?

     
    • Pascal

      Pascal - 2025-11-01

      Hello Jeffrey,
      thanks a lot for your reply. No worry :)
      Fatality I just scanned a bunch of documents now and I tested to "save as text". Indeed, the job is great compared to the same document in pdf. Very weird...

       
  • Jeffrey Ratcliffe

    • status: open --> closed
     
  • Jeffrey Ratcliffe

    Glad you could find a solution

     
  • Pascal

    Pascal - 2025-11-01

    Thanks, but it's not a solution for me. The best would be that when I select all the text in a pdf and then past in an other document, it would keep the same number of line feed...

     
  • Jeffrey Ratcliffe

    OK. I misunderstood. But it is going to be difficult for me to influence how Okular formats text it places into the clipboard.

     
  • Pascal

    Pascal - 2025-11-02

    Hello,
    IMHO, I think the problem doesn't come from Okular.
    Following my given example, if you look into the gscan2pdf OCR recognition tab:
    https://2plz.fr/lutim/gallery#TggvwUqP/DKILVETY.png
    Almost each word are "separated" so that gives a line feed after them.
    It gives this text if you copy / past from the generated pdf:

    Conformément
    à l’article
    12 du
    Règlement du Fonds, le Fonds a procédé à sa deuxième
    distribution.
    Ce deuxième
    remboursement
    de capital
    s'élève
    à
    un
    montant
    de
    5.74
    €
    par part,
    soit
    5,74 % du nominal
    investi.
    

    It is definitely not the same thing as the raw text output that is close to the reality

     

    Last edit: Pascal 2025-11-02
  • Pascal

    Pascal - 2025-11-05

    Hello,
    to be clearer I think it would be better to make small test files. So here they are. I wrote a small text with Writer and then exported it to pdf: Test_gscan2pdf.pdf
    Then, I converted it to jpg file (with The Gimp): Test_gscan2pdf.jpg
    Then, imported into gscan2pdf, did OCR, then exported again to pdf: Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf
    Then, open the original pdf (Test_gscan2pdf.pdf ), select first paragraph and past here:

    “A Hare one day ridiculed the short feet and slow pace of the Tortoise, who replied, laughing:
    “Though you be swift as the wind, I will beat you in a race.”
    

    Finally, open the generated pdf (Pascal06_Test_gscan2pdf.odt_2025-11-05.pdf), select first paragraph and past here:

    “A Hare
     one
     day ridiculed
     the short
     feet and slow
     pace
     of the Tortoise,
     who
     replied, laughing:
    “Though you be swift
     as
     the wind, I will
     beat
     you
     in
     a
     race.”
    

    For me and IMHO, the two texts should be identical...

     
  • Jeffrey Ratcliffe

    OK. But it is Okular that is inserting the CR characters, not gscan2pdf.

    The difference is the formatting. When Writer created the PDF, it created a single text box. Okular can see that it is all one text box, and gives you the text you expect This was lost when converting to JPG. OCR created a box per word, in order to get the word positions correct. OCR does not give much hint of the fonts used, so these must be guessed.

    It would be possible to embed the text in the PDF differently, but then the positions would be wrong.

     

Log in to post a comment.

MongoDB Logo MongoDB