Multiple languages / remove line breaks in bulk

Brought to you by: nguyenq

Multiple languages / remove line breaks in bulk

Forum: Help

Creator: Evander v

Created: 2020-10-11

Updated: 2020-10-29

Evander v - 2020-10-11

Hi
I am using VietOCR and have had a few problems. I have a text mostly in English with some Greek so I have downloaded the language packs via the VietOCR menu. However when I run it even with both English and Greek selected, it parses the Greek text in English characters. I've tried Greek, Ancient Greek and Greek Script.

My other problem is I would like to run OCR in bulk and also remove the line breaks. Is there a way to do this? I've run the bulk process successfully otherwise. On single pages I can remove line breaks when I run the OCR in the program but is there a way to do this in bulk?

Thanks in advance
Evander

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2020-10-12

Evander,

If you use the Java version, you'd enter either ell+eng or eng+ell in OCR Language box. I suspect you would get the same results if you ran Tessseract at the command line. Can you verify?

OK, I've uploaded a version that provides support for pre- and post-processing to Bulk/Batch ops. Please try it out and provide feedback.

Thanks,
Quan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Evander v - 2020-10-12

Hi, Quan, thanks for your reply.
I am trying now to run the Java version. When I go to download the language data, it tells me I don't have write access to the tessdata folder. Looking at the folder settings, I have full access so I don't understand why it won't work. However, I downloaded and put the ell file in manually and it's now appearing in the program.

I was still getting the same results with eng+ell: the Greek text is coming out as English characters. I tested it with an image I made and it only seems to recognise Greek if the entire line of the text is in Greek. Greek words within English sentences are still coming out as English characters. Even if a whole sentence is in Greek letters in the image, it is parsed as English characters if there are English characters on the same line. Any ideas how to fix this? Unfortunately I don't have experience running Tesseract from the command line so I can't say if it's the same.

The post-processing for the batch ops works. Thank you!
Evander

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2020-10-12

Tesseract Windows installer is available here: https://github.com/UB-Mannheim/tesseract/wiki

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

If the CLI produces the same result, you may want to put in a ticket at https://github.com/tesseract-ocr/tesseract/issues or ask questions in https://groups.google.com/g/tesseract-ocr .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Evander v - 2020-10-13

Hi Quan
I have managed to get it running from the command line. It correctly parses Greek text wherever it is in the sample which is great! My next question then is, is it possible to run batch OCR from the cmd line? As VietOCR is still giving me Greek characters only when the whole line is in Greek. An option for me would be to run the pages with Greek manually but I'd rather avoid if possible!
Thanks again

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2020-10-13

Your Tesseract version is 5.0.0-alpha whereas the version used in VietOCR is 4.1.1 -- I wonder if that made the difference. Did you experience the same issue with both VietOCR Java and .NET versions?

You certainly can write DOS or PowerShell script for your batch program if you're on Windows.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Evander v - 2020-10-17

Hi Quan, that may be the difference. Yes, both Java and .NET versions only parsing Greek if the entire line is Greek characters.

Ah, I don't have that much experience to script such a thing but it's useful to know it can be done. I'll ask others I know have experience with running Tesseract from the command line.

Evander

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2020-10-18

Evander,

I put VietOCR-6.0.0-Beta out there for you to try. It uses Tesseract 5.0.0-alpha library. Let me know if it works for you.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Evander v - 2020-10-29

Hi Quan
Thank you so much. Now the VietOCR program is working perfectly for my needs.

Evander

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.