Menu

#199 Enhancement: add new "pdf" as default option when using Tesseract OCR engine

v1.0_(example)
open
nobody
None
5
2015-08-04
2015-06-23
Wikinaut
No

The latest Tesseract version (build from the sources https://code.google.com/p/tesseract-ocr/wiki/Compiling ) is able to directly create a mixed-mode pdf output.

Details see https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_produce_searchable_PDF_output?

You only have to add "pdf" as output format option. I use this since more than a year.

I suggest to add an option to gscan2pdf (default=on) when it is used with Tesseract so that mixed-mode PDFs can be directly created.

Remark: Tesseract can create the multiple file types in one go (i.e. text, hOCR, pdf)!

Let me know if you need more details, or contact the Tesseract stuff through https://code.google.com/p/tesseract-ocr/issues/list but I don't think that this will be necessary.

Related

Bugs: #199

Discussion

  • Wikinaut

    Wikinaut - 2015-06-23

    To clarify: mixed-mode PDF == PDF containing the original input image AND the ocr-ed-text layer

     
  • Jeffrey Ratcliffe

    On 24 June 2015 at 00:17, Wikinaut wikinaut@users.sf.net wrote:

    [bugs:#199] Enhancement: add new "pdf" as default option when using
    Tesseract OCR engine

    By default, I personally use djvu because the files are smaller.

    What is the advantage of tesseract creating the PDF? How does
    tesseract select the image format and compression?

     

    Related

    Bugs: #199

  • Wikinaut

    Wikinaut - 2015-06-24

    Tesseract is the OCR engine which is used by Google books (according to information leaked on the internet), and it is actively maintained, I recently submitted some bug reports which bugs were fixed quickly.

    The pdf mixed-mode output is perfect, Tesseract uses LZW compression so that the original(!) input image compression is preserved. I pointed the developers to a problem with their "best compression mode" detection algorithm, which is fixed since about one year.

    If you wish, I can send you an scanned image and an OCR processed mixed-mode output.

     
  • Wikinaut

    Wikinaut - 2015-07-10

    Hi, I would like to continue now the work towards a solution of this issue. Basically, in /lib/Gscan2pdf/Tesseract.pm line 237

    I added the option switch "tessedit_create_pdf" for creating not only hocr but also pdf (Tesseract can create hocr- and pdf-output in a single invocation).

    $cmd =
    "echo tessedit_create_hocr 1 > hocr.config;tessedit_create_pdf 1 > pdf.config;tesseract $tif $path$name -l $options{language} +pdf.config +hocr.config;rm hocr.config;rm pdf.config";
    }
    

    but I cannot find the created .pdf file in the /tmp directory.

    Perhaps you as the program author can assist (what did I wrong?).

     
    • Wikinaut

      Wikinaut - 2015-08-23

      When you want to let Tesseract generate mixed-mode PDF pages (without changing the current gscan2psd workflow), then change

      in /lib/Gscan2pf/Tesseract.pm change line 237 to

       "echo -e 'tessedit_create_hocr 1\ntessedit_create_pdf 1' > hocr.config;tesseract $tif $path$name -l $options{language} +hocr.config;rm hocr.config";
      

      When scanning and OCRing with gscan2pdf, the Tesseract-generated .pdf (and also .txt) files will remain undeleted in /tmp . Please simply compare this pdf output with output of gscan2pdf.

      I plan to extend this in the future so that tesseract-generated multi-page PDFs will be created at almos tno additional costs (as already mentioned elsewhere, tessseract can generate hocr, txt and pdf in one run).

       
  • Jeffrey Ratcliffe

    Please post the complete patch of your current changes and I will take a look.

     
  • Wikinaut

    Wikinaut - 2015-07-13

    This is the complete patch. It is only necessary to add "tessedit_create_pdf 1" as parameter (or to add the keyword "pdf" to the command line).

    However, I could not find the created double-layer pdf file in the tmp directory.

     
  • Wikinaut

    Wikinaut - 2015-07-29

    RE: /lib/Gscan2pdf/Tesseract.pm line 236

    Suggestion 1:

    I suggest you change line 236 to:

    $cmd = "tesseract $tif $path$name -l $options{language} -c tessedit_create_hocr=1 -c tessedit_create_pdf=1";

    Rationale:

    1. tesseract allows multiple "-c parameter=value" configurations, see tesseract --help
    2. this avoids the creation and deletion step of the temporary file for the parameters in the present gscan2pdf code
    3. the new additional "-c tessedit_create_pdf=1" creates single-page mixed-mode pdf files in the tmp folder, in addition to txt and hocr files, which are also created.

    Perhaps you can find the time to confirm my observation and patch. In my view it's more elegant to let tesseract do the work of composing the mixed-mode pdfs.

    Suggestion 2:

    It would now be nice to have a new third window (in addition to the scan and to the hocr output) in Gscan2pdf GUI for rendering the mixed-mode PDF, and from where the user could cut text directly.

     

    Last edit: Wikinaut 2015-07-29
  • Jeffrey Ratcliffe

    Rendering a PDF is definately non-trivial. Evince, does this, and the "View files on saving" option opens evince (or whatever you have set as your PDF viewer) from which you can copy text.

    Without trying your suggestion, my immediate thoughts are:

    1. Letting tesseract create the PDF gives no control over the scan compression, and LZW is typically worse than PNG for 1-bit scans, or JPG for 8-bit scans.
    2. There is no possibility to downscale the image between the OCR and save steps.
    3. I was planning to introduce a background, test image split with different resolutions. This would not be possible via tesseract.
    4. No possibility to check or edit the OCR output before saving.

    Having said that. I'll take a look as soon as I have got the next release out.

     
  • Wikinaut

    Wikinaut - 2015-07-29

    Summary:

    • Thanks for your reply.
    • I think, we can stop the discussion now, because I learned from your post, that Gscan2pdf offers many more options and flexibility, and it would mean sacrificing its flexibility when using tesseract as the pdf mixed-mode builder.
    • I informed you about the possibility of new tesseract versions, which could be integrated at a later time into Gscan2pdf - if this is needed. Currently, there appears to be no real need.

    Just two things in reply to your previous post:

    1. Please do not even think about using lossy[sic] compression (e.g. JPEG) when composing/re-creating a mixed-mode PDF. Lossy compression always introduces coding artefacts - I had the discussion with the tesseract people, and I don't want to go into that discussion again.

    Thus, JPEG (and any other lossy compression) is not an option. I understand final PDF that file sizes may matter, and that your program allows the flexibility to fine-tune and optimise the resulting files sizes. This is great.

    I just wanted to mention and try to find a way to fully use the built-in capability of tesseract, when used with "-c tessedit_create_pdf=1".

    1. I noticed yesterday, that the present version of Gscan2PDF really appears to create mixed-mode PDF like I need them -- please can you confirm this expressly ?

    Regarding my personal use cases, the main purpose of Gscan2Pdf is the scanning and archiving of paper documents. The needed output is PDF image with OCR, and recent tesseract does it very well.

    So what I suggest is to leave your hocr window as it is, but perhaps to find a way to display the tesseract generated single PDF pages in a third window (on OpenSuse systems, we use okular. I have currently no idea whether it will be possible to show this in a "frame" inside Gscan2pdf.). Then, at the last stage, the seleted single-pages and meta data needs to be merged into the final PDF. The change is not a "must have", but an additional feature.

    I agree, if you close this issue now, because in my view we have discussed it thoroughly, and there is no real need any more for using "tesseract -c tessedit_create_pdf", given that my observation is correct that Gscan2pdf already creates mixed-mode PDFs.

     
  • Wikinaut

    Wikinaut - 2015-07-29

    Update

    I found, that Gscan2pdf OCR generation of mixed-mode pdf is not preserving the original text layout, for example, a two-column format. Gscan2pdf is treating word-by-word, whereas "tesseract -c tessedit_create_pdf=1" creates the mixed-mode PDF based on an extensive analysis of the image.

    So in contradiction to what I said in my previous post above, there is a need to use tesseract for creating mixed-mode pdfs, because tesseract comes with built-in text area detection and tries to create text and pdf files while preserving the logical text flow.

    In that sense, it differs from Gscan2pdf and offers an additional value (Hybrid Page Layout Analysis). This is explained in the three scientific papers in section IEEE Copyright Materials and ACM Copyright Materials on page https://code.google.com/p/tesseract-ocr/wiki/Documentation .

    By the way, tesseract code moved from google code https://code.google.com/p/tesseract-ocr/ to github just a couple of days ago: https://github.com/tesseract-ocr/tesseract .

    I am still interested in bringing gscan2pdf and tesseract together, but need a better understanding how Gscan2pdf works behind the scene (and will then perhaps, after testing, come up with a change request for an improved Gscan2pdf version).

     
  • Jeffrey Ratcliffe

    What do you mean by "I found, that Gscan2pdf OCR generation of mixed-mode pdf is not preserving the original text layout, for example, a two-column format. Gscan2pdf is treating word-by-word, whereas "tesseract -c tessedit_create_pdf=1" creates the mixed-mode PDF based on an extensive analysis of the image."

    ?

    gscan2pdf takes the hocr output from tesseract and uses it to place the text. If the tesseract output is word-accurate, then so should be the document produced by gscan2pdf.

    If you use djvu, instead of PDF, then this is always accurate. PDF output is complicated by the need to specify a font, and as tesseract does not provide font information in the hocr output, gscan2pdf has to guess the size.

     
  • Wikinaut

    Wikinaut - 2015-07-30

    I meant the difference between gscan2pdf and tesseract in that sense, that when retrieving the text from the PDF, I get different text layouts when comparing copied text from gscan2pdf and tesseract's pdf outputs.

    Try to gscan2pdf a two-column newspaper, for example, and save as PDF. Now try to get the text from the PDF and reuse the text in e.g. kate or libreoffice.

    My observation is, that gscan2pdf->pdf results in one-word-per-line, an does not preserve the (blokwise, column-wise) original text layout, whereas tesseract at least preserves the logicla structure, so that a user can paste and reuse the text in an editor, or mail program.

    Regarding djvu. neither me nor any my friends uses djvu at the moment. By the way, your program is not called gscan2djvu but gscan2pdf-

    Please perhaps you can find the time and compare the text from pdf from gscan2pdf with text from pdf from tesseract:

    Use a two- or three-column original e.g. from newspaper or magazine.

    Then pls. try to generate pdf via these two ways (assuming English as OCR language):

    1

    gscan2pdf

    -> save as x.PNG

    -> use "tesseract x.PNG x pdf" (this line is almost equivalent to "tesseract x.PNG -c tessedit_create_pdf=1")

    Open x.pdf with okular or Acrobat Reader, select all, copy the text to a text editor, save as x.TXT

    compare the text layout with the results of

    2

    gscan2pdf

    -> save as x2.PDF

    open x2.PDf with okular or Acrobat Reader, select all, copy the text to a text editor, save as x2.TXT

    Compare the layout of x.TXT and x2.TXT.

     
  • Wikinaut

    Wikinaut - 2015-07-30

    In the meantime I found that a "select all/copy text" is incorrect in both cases (gscan2pdf and tesseract), I meant, one cannot mark a two column text and expect that the text layout is then correct (It is my mistake, sorry).

    The correctly formatted ocr-ed text is in a text file with the extension txt in the tmp folder, produced by tesseract.

     
  • Wikinaut

    Wikinaut - 2015-08-03

    Hello Jeffrey, regarding this issue I would like to start working on a kind of additional option (not touching present gscan2pdf workflow/code), and asking you now only for some information, where to start.

    My idea is to

    • use the tesseract 3.0.x pdf option for ocr-ing page-by page (selected pages only)
    • combining the single pages
    • using gscan2pdf saving options

    In that way, I could use the gscan2pfd framework for scanning and selecting, but would use the tesseract for page layout determination and ocr.

    An ad-hoc solution would be a kind of a "hook", which is called by gscan2pdf when the user has selected the scanned pages for saving. That's the point where tesseract-pdf would start working (inputs are single-page, lossless compressed or uncompressed images from the scan steps), page-by-page, and the single-page mixed-mode result needs then to be merged into a single multi-page, mixed-mode pdf.

    I am very interested in your thoughts, and assistance.

     

    Last edit: Wikinaut 2015-08-03
  • Jeffrey Ratcliffe

    Firstly to your comment before last, are you saying that tesseract's txt output is not the same as gscan2pdf's txt output after using tesseract? Please attach a test image which reproduces the problem.

    As far as using gscan2pdf to combine single-page PDFs created by tesseract is concerned, I don't see the advantage, as if you selected tesseract as the OCR engine in gscan2pdf, then you already use tesseract for page layout determination and OCR.

     
    • Wikinaut

      Wikinaut - 2015-08-23

      see https://sourceforge.net/p/gscan2pdf/bugs/199/#8100/678c where I added the correct (test) change for gscan2pdf so that tesseract generates also the PDF page.

       
  • Wikinaut

    Wikinaut - 2015-08-04

    I can deliver some examples in the next 48 hours, but probably not earlier, whereas you could generate the two text files very quickly at your site.

    What tesseract concerns, use "tesseract image.png outfile -l eng hocr pdf" to create outfile,hocr, outfile.pdf and outfile.txt in a single run.

     

Log in to post a comment.

MongoDB Logo MongoDB