#138 Problems with tesseract ocr also ocropus


I was using ocropus in the past but since I upgraded to Ubunntu 12.04 ocropus is no longer available thus I can't use with gscan2pdf naturally. It appears this was a result of a bug in Ocropus. This has since been fixed with the release of ocropus 0.5 but it isn't in the repo yet. Not really a gscan2pdf issue but I figure I'd bring it up.

I switched to tesseract but this gives me issues. Some of my pages give me this error.

utf8 "\x80" does not map to Unicode at /.../lib/Gscan2pdf.pm line 921, <>
chunk 1.

thus when I try to save the resulting file as a djvu it says bad characters and it hangs. I cancel the save in gscan and then delete the bad pages. The problem is gscan doesn't completely cancel the save because when I try to resave it just says doing process 1 of 2 and I assume process was is the last save since it continues to hang.

I'm using 1.0.4


  • Jeffrey Ratcliffe

    The UTF8 issue is due to a bug in Tesseract, and I have already created a patch to workaround the problem in gscan2pdf:


    You'll see the fix in the upcoming release 1.0.5.

    If you are not happy patching the source, another workaround would be to use cuneiform.

  • Jeffrey Ratcliffe

    • status: open --> closed-fixed
    • Group: --> v1.0_(example)

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks