I am trying to extract a text from scanned pdf.. for this i use VietOCR...i download a arabic package but still i receive a wrong text...i receive arabic character but 20% words with errors
how i can resolve this? i try also to do a small java tool with tess4j...but i got the same effects?
Thank you for any help.
Mourinho
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The quality of the image plays an important part on the quality of the output text. You may need to improve the scanninng (300DPI, grayscale or B/W, for example), preprocess the image (Improve Quality), tweak Tesseract engine, and lastly, perform post-OCR corrections.
If the font does not resemble the supported fonts, you may need to consider training Tesseract to recognize that font.
Last edit: Quan Nguyen 2014-04-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to extract a text from scanned pdf.. for this i use VietOCR...i download a arabic package but still i receive a wrong text...i receive arabic character but 20% words with errors
how i can resolve this? i try also to do a small java tool with tess4j...but i got the same effects?
Thank you for any help.
Mourinho
The quality of the image plays an important part on the quality of the output text. You may need to improve the scanninng (300DPI, grayscale or B/W, for example), preprocess the image (Improve Quality), tweak Tesseract engine, and lastly, perform post-OCR corrections.
If the font does not resemble the supported fonts, you may need to consider training Tesseract to recognize that font.
Last edit: Quan Nguyen 2014-04-20