Capture2Text / Tickets / #21 Bad at character aspect ratio when space between lines is small

#21 Bad at character aspect ratio when space between lines is small

Milestone: 4.3.0

Status: closed

Owner: cb4960

Labels: bug (17)

Updated: 2017-10-22

Created: 2017-08-30

Creator: Alex N

Private: No

When trying to OCR manga panels where the space between lines is very small, like this one:

https://i.imgur.com/8nfAmgQ.png

capture2text invariably messes up at telling where characters start/end horizontally and basically merges the columns together. The output is really bad:

諦叩剛卿弧髑を

調叩幟卿伽謗を

Sometimes it gets the wrong number of vertical character segmentations entirely:

謂帥卿馨を

I don't know what it's doing internally, but if it's only making decisions once, maybe it could try multiple passes if it's unsure of what the characters are sized like, and pick the most "language-like" pass.

If I clip one line at a time it OCRs fine.

Discussion

Alex N - 2017-08-30

An off-topic tangent, on the notion of "trying multiple passes" trying to change uncertain variables to look for better results:

I contribute to an interactive japanese text parser called Spark Reader, and recently added a function for "fixing up" OCR by testing character replacements for characters that OCR is bad at, through brute force, keeping the replacements that result in a simpler parse. I posted about it on a forum, and someone thought that SR was a manga reader instead of a parser, so it seems like people might not find/use something that fixes up OCR with parsing or morphological analysis unless it's in an OCR program itself.

capture2text would probably benefit a lot from having similar functionality built in, but including a full parser would be overkill, and using parsing for something like that is already massive overkill. Maybe for something like the OCR manga reader it would make sense.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex N - 2017-09-15

I ended up finding tesseract 4 and hacking an interface for it together out of sharex, imagemagick, and a shell script. It basically never gets text layout wrong.

Imagemagick stuff:

convert -alpha off -auto-level -filter Sinc -define filter:window=Hann -define filter:lobes=3 -distort resize 200% -sigmoidal-contrast 5x50% -unsharp 0x3 -distort resize 50% -set units PixelsPerInch -density 600

Tesseract 4 config:

preserve_interword_spaces 0
paragraph_text_based 0
textord_old_baselines 1
lstm_use_matrix 1

Last edit: Alex N 2017-09-15

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cb4960 - 2017-09-17

Here is a workaround for this issue:

1) Create a text file and name it something like my_tess_config.txt
2) Add the following line to this file:

textord_min_linesize 1.25

3) Open the Capture2Text Settings and navigate to the "OCR 1" tab.
4) In the "Tesseract Config File" option, select the file you created in step 1 (my_tess_config.txt.).
5) Click OK.

Currently Capture2Text sets the "textord_min_linesize" option to 2.5 to help increase accuracy, but apparently it can lead to reduced accuracy in some cases. The above procedure sets it back to its default of 1.25. I'll look into a proper fix for the next release.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cb4960 - 2017-09-17

labels: --> bug

status: open --> accepted

assigned_to: Christopher Brochtrup
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cb4960 - 2017-10-22

status: accepted --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cb4960 - 2017-10-22

Fixed in 4.5.0.

For Japanese, when multiple lines are detected, set textord_min_linesize to 2.0, otherwise use 2.5.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.