-----BEGIN PGP SIGNED MESSAGE-----
I am busy scanning articles from an old russian book. the layout is
simple. The book is like a bit bigger than a5 so I can scan 2 pages at
All goes well. the pages are turned and splitted in 2 pages and unpapered
The problem I have is with the ocr (tesseract). The articles use for
reference also english titles and sometimes english names within the
articles. How do I let tesseract know this? Also how do I train
tesseract on errors?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
-----END PGP SIGNATURE-----
On 7 June 2013 15:54, Bastiaan Wakkie <bwakkie@...> wrote:
> The problem I have is with the ocr (tesseract). The articles use for
> reference also english titles and sometimes english names within the
> articles. How do I let tesseract know this? Also how do I train
> tesseract on errors?
For questions on how to train tesseract, you are better off asking on
the tesseract mailing list.
Otherwise, I know that cuneiform (which gscan2pdf also supports) has a
"Rus-Eng" language, which I assume does what you want. As I don't know
Russian, I have no idea how accurate it is.