#138 Problems with tesseract ocr also ocropus

v1.0_(example)
closed-fixed
nobody
None
5
2014-01-27
2012-06-06
gizmobay
No

I was using ocropus in the past but since I upgraded to Ubunntu 12.04 ocropus is no longer available thus I can't use with gscan2pdf naturally. It appears this was a result of a bug in Ocropus. This has since been fixed with the release of ocropus 0.5 but it isn't in the repo yet. Not really a gscan2pdf issue but I figure I'd bring it up.

I switched to tesseract but this gives me issues. Some of my pages give me this error.

utf8 "\x80" does not map to Unicode at /.../lib/Gscan2pdf.pm line 921, <>
chunk 1.

thus when I try to save the resulting file as a djvu it says bad characters and it hangs. I cancel the save in gscan and then delete the bad pages. The problem is gscan doesn't completely cancel the save because when I try to resave it just says doing process 1 of 2 and I assume process was is the last save since it continues to hang.

I'm using 1.0.4

Discussion

  • The UTF8 issue is due to a bug in Tesseract, and I have already created a patch to workaround the problem in gscan2pdf:

    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=670831

    You'll see the fix in the upcoming release 1.0.5.

    If you are not happy patching the source, another workaround would be to use cuneiform.

     
    • status: open --> closed-fixed
    • Group: --> v1.0_(example)