The latest Tesseract version (build from the sources https://code.google.com/p/tesseract-ocr/wiki/Compiling ) is able to directly create a mixed-mode pdf output.
Details see https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_produce_searchable_PDF_output?
You only have to add "pdf" as output format option. I use this since more than a year.
I suggest to add an option to gscan2pdf (default=on) when it is used with Tesseract so that mixed-mode PDFs can be directly created.
Remark: Tesseract can create the multiple file types in one go (i.e. text, hOCR, pdf)!
Let me know if you need more details, or contact the Tesseract stuff through https://code.google.com/p/tesseract-ocr/issues/list but I don't think that this will be necessary.
To clarify: mixed-mode PDF == PDF containing the original input image AND the ocr-ed-text layer
On 24 June 2015 at 00:17, Wikinaut wikinaut@users.sf.net wrote:
By default, I personally use djvu because the files are smaller.
What is the advantage of tesseract creating the PDF? How does
tesseract select the image format and compression?
Related
Bugs: #199
Tesseract is the OCR engine which is used by Google books (according to information leaked on the internet), and it is actively maintained, I recently submitted some bug reports which bugs were fixed quickly.
The pdf mixed-mode output is perfect, Tesseract uses LZW compression so that the original(!) input image compression is preserved. I pointed the developers to a problem with their "best compression mode" detection algorithm, which is fixed since about one year.
If you wish, I can send you an scanned image and an OCR processed mixed-mode output.
Hi, I would like to continue now the work towards a solution of this issue. Basically, in /lib/Gscan2pdf/Tesseract.pm line 237
I added the option switch "tessedit_create_pdf" for creating not only hocr but also pdf (Tesseract can create hocr- and pdf-output in a single invocation).
but I cannot find the created .pdf file in the /tmp directory.
Perhaps you as the program author can assist (what did I wrong?).
When you want to let Tesseract generate mixed-mode PDF pages (without changing the current gscan2psd workflow), then change
in /lib/Gscan2pf/Tesseract.pm change line 237 to
When scanning and OCRing with gscan2pdf, the Tesseract-generated .pdf (and also .txt) files will remain undeleted in /tmp . Please simply compare this pdf output with output of gscan2pdf.
I plan to extend this in the future so that tesseract-generated multi-page PDFs will be created at almos tno additional costs (as already mentioned elsewhere, tessseract can generate hocr, txt and pdf in one run).
Please post the complete patch of your current changes and I will take a look.
This is the complete patch. It is only necessary to add "tessedit_create_pdf 1" as parameter (or to add the keyword "pdf" to the command line).
However, I could not find the created double-layer pdf file in the tmp directory.
RE: /lib/Gscan2pdf/Tesseract.pm line 236
Suggestion 1:
I suggest you change line 236 to:
$cmd = "tesseract $tif $path$name -l $options{language} -c tessedit_create_hocr=1 -c tessedit_create_pdf=1";
Rationale:
Perhaps you can find the time to confirm my observation and patch. In my view it's more elegant to let tesseract do the work of composing the mixed-mode pdfs.
Suggestion 2:
It would now be nice to have a new third window (in addition to the scan and to the hocr output) in Gscan2pdf GUI for rendering the mixed-mode PDF, and from where the user could cut text directly.
Last edit: Wikinaut 2015-07-29
Rendering a PDF is definately non-trivial. Evince, does this, and the "View files on saving" option opens evince (or whatever you have set as your PDF viewer) from which you can copy text.
Without trying your suggestion, my immediate thoughts are:
Having said that. I'll take a look as soon as I have got the next release out.
Summary:
Just two things in reply to your previous post:
Thus, JPEG (and any other lossy compression) is not an option. I understand final PDF that file sizes may matter, and that your program allows the flexibility to fine-tune and optimise the resulting files sizes. This is great.
I just wanted to mention and try to find a way to fully use the built-in capability of tesseract, when used with "-c tessedit_create_pdf=1".
Regarding my personal use cases, the main purpose of Gscan2Pdf is the scanning and archiving of paper documents. The needed output is PDF image with OCR, and recent tesseract does it very well.
So what I suggest is to leave your hocr window as it is, but perhaps to find a way to display the tesseract generated single PDF pages in a third window (on OpenSuse systems, we use okular. I have currently no idea whether it will be possible to show this in a "frame" inside Gscan2pdf.). Then, at the last stage, the seleted single-pages and meta data needs to be merged into the final PDF. The change is not a "must have", but an additional feature.
I agree, if you close this issue now, because in my view we have discussed it thoroughly, and there is no real need any more for using "tesseract -c tessedit_create_pdf", given that my observation is correct that Gscan2pdf already creates mixed-mode PDFs.
Update
I found, that Gscan2pdf OCR generation of mixed-mode pdf is not preserving the original text layout, for example, a two-column format. Gscan2pdf is treating word-by-word, whereas "tesseract -c tessedit_create_pdf=1" creates the mixed-mode PDF based on an extensive analysis of the image.
So in contradiction to what I said in my previous post above, there is a need to use tesseract for creating mixed-mode pdfs, because tesseract comes with built-in text area detection and tries to create text and pdf files while preserving the logical text flow.
In that sense, it differs from Gscan2pdf and offers an additional value (Hybrid Page Layout Analysis). This is explained in the three scientific papers in section IEEE Copyright Materials and ACM Copyright Materials on page https://code.google.com/p/tesseract-ocr/wiki/Documentation .
By the way, tesseract code moved from google code https://code.google.com/p/tesseract-ocr/ to github just a couple of days ago: https://github.com/tesseract-ocr/tesseract .
I am still interested in bringing gscan2pdf and tesseract together, but need a better understanding how Gscan2pdf works behind the scene (and will then perhaps, after testing, come up with a change request for an improved Gscan2pdf version).
What do you mean by "I found, that Gscan2pdf OCR generation of mixed-mode pdf is not preserving the original text layout, for example, a two-column format. Gscan2pdf is treating word-by-word, whereas "tesseract -c tessedit_create_pdf=1" creates the mixed-mode PDF based on an extensive analysis of the image."
?
gscan2pdf takes the hocr output from tesseract and uses it to place the text. If the tesseract output is word-accurate, then so should be the document produced by gscan2pdf.
If you use djvu, instead of PDF, then this is always accurate. PDF output is complicated by the need to specify a font, and as tesseract does not provide font information in the hocr output, gscan2pdf has to guess the size.
I meant the difference between gscan2pdf and tesseract in that sense, that when retrieving the text from the PDF, I get different text layouts when comparing copied text from gscan2pdf and tesseract's pdf outputs.
Try to gscan2pdf a two-column newspaper, for example, and save as PDF. Now try to get the text from the PDF and reuse the text in e.g. kate or libreoffice.
My observation is, that gscan2pdf->pdf results in one-word-per-line, an does not preserve the (blokwise, column-wise) original text layout, whereas tesseract at least preserves the logicla structure, so that a user can paste and reuse the text in an editor, or mail program.
Regarding djvu. neither me nor any my friends uses djvu at the moment. By the way, your program is not called gscan2djvu but gscan2pdf-
Please perhaps you can find the time and compare the text from pdf from gscan2pdf with text from pdf from tesseract:
Use a two- or three-column original e.g. from newspaper or magazine.
Then pls. try to generate pdf via these two ways (assuming English as OCR language):
1
gscan2pdf
-> save as x.PNG
-> use "tesseract x.PNG x pdf" (this line is almost equivalent to "tesseract x.PNG -c tessedit_create_pdf=1")
Open x.pdf with okular or Acrobat Reader, select all, copy the text to a text editor, save as x.TXT
compare the text layout with the results of
2
gscan2pdf
-> save as x2.PDF
open x2.PDf with okular or Acrobat Reader, select all, copy the text to a text editor, save as x2.TXT
Compare the layout of x.TXT and x2.TXT.
In the meantime I found that a "select all/copy text" is incorrect in both cases (gscan2pdf and tesseract), I meant, one cannot mark a two column text and expect that the text layout is then correct (It is my mistake, sorry).
The correctly formatted ocr-ed text is in a text file with the extension txt in the tmp folder, produced by tesseract.
Hello Jeffrey, regarding this issue I would like to start working on a kind of additional option (not touching present gscan2pdf workflow/code), and asking you now only for some information, where to start.
My idea is to
In that way, I could use the gscan2pfd framework for scanning and selecting, but would use the tesseract for page layout determination and ocr.
An ad-hoc solution would be a kind of a "hook", which is called by gscan2pdf when the user has selected the scanned pages for saving. That's the point where tesseract-pdf would start working (inputs are single-page, lossless compressed or uncompressed images from the scan steps), page-by-page, and the single-page mixed-mode result needs then to be merged into a single multi-page, mixed-mode pdf.
I am very interested in your thoughts, and assistance.
Last edit: Wikinaut 2015-08-03
Firstly to your comment before last, are you saying that tesseract's txt output is not the same as gscan2pdf's txt output after using tesseract? Please attach a test image which reproduces the problem.
As far as using gscan2pdf to combine single-page PDFs created by tesseract is concerned, I don't see the advantage, as if you selected tesseract as the OCR engine in gscan2pdf, then you already use tesseract for page layout determination and OCR.
see https://sourceforge.net/p/gscan2pdf/bugs/199/#8100/678c where I added the correct (test) change for gscan2pdf so that tesseract generates also the PDF page.
I can deliver some examples in the next 48 hours, but probably not earlier, whereas you could generate the two text files very quickly at your site.
What tesseract concerns, use "tesseract image.png outfile -l eng hocr pdf" to create outfile,hocr, outfile.pdf and outfile.txt in a single run.