Failing to read text from image, says Empty page!!

2012-11-22
2012-11-26
  • Quan Nguyen
    Quan Nguyen
    2012-11-22

    Any OCR engine would have difficulty handling CAPTCHA.

    You may have better success rates after thresholding to gray or monochome and rescaling image to 300 DPI, and trying out with different page segmentation modes.

     
    Last edit: Quan Nguyen 2012-11-22
  • Hi Quan,

    Would you please tell me what is page segmentation mode?, any example that support.
    As seen the code we have only one eng.traineddata file available. Would this consider all font types (Arial, Times New Roman etc) and overlapped chars including Bold, Italic?
    Also let me know how to create traineddata file for any other languages or fonts.

    Appreciate your help on this.

    Regards,
    Lakshman

     
  • Quan Nguyen
    Quan Nguyen
    2012-11-26

    Hi Lakshman,

    You can check the project's documentation for info about PSM -- their names literally describe what each mode does.

    eng.traineddata covers basic fonts and styles. You can unpack the file or check Tesseract Wiki for details about the language data and the training process.

    http://code.google.com/p/tesseract-ocr/

    Regards,
    Quan