Re: [gscan2pdf-help] ocr and resolution
Brought to you by:
ra28145
|
From: Richard L. <ri...@th...> - 2011-01-31 09:17:27
|
On Sun, Jan 30, 2011 at 09:55:13AM +0100, Rainer Dorsch wrote: > Am Samstag, 29. Januar 2011 schrieb Andy Fingerhut: > > On Jan 29, 2011, at 9:20 AM, Rainer Dorsch wrote: [...] > > > One thing I was surprised though is that it seems to me that the text > > > embedded in the pdf is far off the location of the real text. > > > Essentially the text embedded in the pdf is all in the top left > > > corner. Is that intended for some reason or a OCR limitation? > > There are multiple OCR engines that gscan2pdf can use. There is a > > popup menu that lets you select among several (if they are installed > > on your machine) when you start the steps to do OCR. [...] > > > > For others, either the OCR engine itself is incapable of producing > > text position data, or gscan2pdf cannot process and use that > > information yet. > > They are definitely all the same quality I have found great differences in reliability in my tests. GOCR was far less accurate than either Tesseract or Ocropus, which usually produce very similar output for me. However, my tests were over six months ago, and may not be representative. > I use gscan2pdf 0.31 and Tesseract 2.04, which produced in my tests better > results than gocr 0.48. [...] > Does anybody have more details on how well Tesseract is supposed to work with > gscan2pdf? I would have continued with GOCR, in preference to Ocropus or Tesseract, because I prefered to have the full OCR text in one block to manually edit if required. Correcting the scattered separate blocks produced by the other two engines is very labour-intensive because you need to click on each block in turn to check and edit it. But, given the relatively poor performance of the GOCR engine in my tests, I opted for Ocropus. I chose this rather than Tesseract because at the time I was testing the output from Tesseract was shown in unreadably small font-size (this may have been corrected since then). With this choice, I abandoned any hope of making manual corrections (which at least saves time, as I don't even stop to inspect the output!) The advantage of the correctly positioned blocks is that the text is shown under the mouse pointer by some pdf readers. This allows cut and paste from the document as displayed in the pdf reader. HTH richard |