I have DVDs with both traditional and simplified Chinese subtitles. I have install both chi_sim and chi_tra tessaract data. I have noticed that both use "zh" as the language identifier. I would like to decode both with high accuracy.
I have gotten one or the other to work. {sort of}. This led into debugging the image extraction and OCR possessing.
When I look at the raw subtitle images, they look like s**t. they look good on the screen because of the outline that is added. I have tried to do the subtitle and image extraction using ogmrip and two command line programs. The poor image reduces accuracy of the OCR for Chinese.
Questions:
1> Is there a way to detect or select traditional or simplified Chinese when selecting the subtitle track in GUI? In the code it just detects "zh" and selects "chi" tessaract dataset.
2> Is there a way to do some image processing before the OCR stage?
I have the code so answers can reference the code.
shaun
1>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
I have DVDs with both traditional and simplified Chinese subtitles. I have install both chi_sim and chi_tra tessaract data. I have noticed that both use "zh" as the language identifier. I would like to decode both with high accuracy.
I have gotten one or the other to work. {sort of}. This led into debugging the image extraction and OCR possessing.
When I look at the raw subtitle images, they look like s**t. they look good on the screen because of the outline that is added. I have tried to do the subtitle and image extraction using ogmrip and two command line programs. The poor image reduces accuracy of the OCR for Chinese.
Questions:
1> Is there a way to detect or select traditional or simplified Chinese when selecting the subtitle track in GUI? In the code it just detects "zh" and selects "chi" tessaract dataset.
2> Is there a way to do some image processing before the OCR stage?
I have the code so answers can reference the code.
shaun
1>