Shaun Savage - 2014-12-19

Hi

I have DVDs with both traditional and simplified Chinese subtitles. I have install both chi_sim and chi_tra tessaract data. I have noticed that both use "zh" as the language identifier. I would like to decode both with high accuracy.

I have gotten one or the other to work. {sort of}. This led into debugging the image extraction and OCR possessing.

When I look at the raw subtitle images, they look like s**t. they look good on the screen because of the outline that is added. I have tried to do the subtitle and image extraction using ogmrip and two command line programs. The poor image reduces accuracy of the OCR for Chinese.

Questions:
1> Is there a way to detect or select traditional or simplified Chinese when selecting the subtitle track in GUI? In the code it just detects "zh" and selects "chi" tessaract dataset.

2> Is there a way to do some image processing before the OCR stage?

I have the code so answers can reference the code.

shaun

1>