First of all, your problem is that your OCR script file is saved with Unicode encoding. It needs to be saved with ANSI encoding. I don't know what program you're editing your OCR script file with, but it saves it as Unicode. You need to be very careful with this and make sure you save the file as ANSI. I'd recommend using Notepad++.
Once you're sure your file is saved as ANSI, this will work like a charm:
2) The WordChars variable doesn't have a limited length. It IS taken into account, I explained the encoding problem above.
3) You can't replace strings with hex codes, that's how the RegEx library that Subtitle Workshop is using is (but I'm not 100% sure, I'll have to double-check on that).
So no need to recompile the code :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2014-02-16
Thank you very much, Mr. Spiridonov, for taking the time to answer and help me. I was using Sublime Text but wasn't aware of the original saving format. Now everything is in Unicode :)
I'm testing my script right now, checking the replacements one by one, and there is something happening that shouldn't.
My original script (French.ocr), the only one I selected in my parameters, doesn't correct the I/l OCR mistakes for the moment because English/French corrections on this shouldn't be the same and are sometimes contradictory.
Now what happens is that though my script doesn't include I (uppercase I) / l (lowercase l) corrections at all, when I check OCR errors, it still takes the Default.ocr replacements into account, while it shouldn't.
I tried to rename my script into Default.ocr (while renaming the old one by _Default.ocr), but it doesn't change anything: the replacements of the original Default.ocr are taken into account anyway.
I tried to restart SW, select the script before I open the subtitles file to be sure, but it didn't help.
Is there something I'm missing here (probably :) )?
Thank you again for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Subtitle Workshop uses only the currently selected OCR script, it can't use your French.ocr and the Default.ocr files both. That's actually a feature I want to implement, but I still haven't.
Are you sure you haven't copied the L/I OCR script code into your French.ocr by mistake?
Can you attach your French.ocr file so that I could check it out?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello and thank you very much for Subtitle Workshop's new version :)
I'm trying to make an OCR script for French and I've been running into a few problems:
1) First, it seems that the SWOCR WordChars in my script is not taken into account at all:
a) The complete French WordChars should look as such (the '\' before '>' are not included in my ocr script):
<SWOCR WordChars=
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_çÇœŒéàèùâêîôûäëïöüÉÀÈÙÂÊÎÔÛÄËÏÖÜ">
Thinking that it wasn't taken into account because it was more than 80 characters, I then reduced it to:
<SWOCR WordChars="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_çÇéÉàèùëïüâêîôû">
To no avail.
For instance, the replacement of the string 'aaa' by 'ççç' (it's a test) gives:
ççç
as a result.
2) Now when entering hexa codes it works:
but if I want to replace 'Ca' by 'Ça' (a very frequent OCR mistake in French) it, of course, doesn't work (for same reasons as in 1)).
3) Besides if I want to replace the string by the relevant hexa codes, it doesn't work either.
For instance, the following line in your Default.ocr script doesn't work. It replaces ' "" ' by ' \x22 ', not by ' " '
Putting an escape '\' before '"' doesn't work either, as in
I would be very grateful if you could help me with these problems and tell me if
Thank you very much in advance.
Hi there.
First of all, your problem is that your OCR script file is saved with Unicode encoding. It needs to be saved with ANSI encoding. I don't know what program you're editing your OCR script file with, but it saves it as Unicode. You need to be very careful with this and make sure you save the file as ANSI. I'd recommend using Notepad++.
Once you're sure your file is saved as ANSI, this will work like a charm:
About the other issues.
1) This line
is indeed wrong and doesn't work. In order to work properly, it should be like this:
2) The WordChars variable doesn't have a limited length. It IS taken into account, I explained the encoding problem above.
3) You can't replace strings with hex codes, that's how the RegEx library that Subtitle Workshop is using is (but I'm not 100% sure, I'll have to double-check on that).
So no need to recompile the code :)
Thank you very much, Mr. Spiridonov, for taking the time to answer and help me. I was using Sublime Text but wasn't aware of the original saving format. Now everything is in Unicode :)
I'm testing my script right now, checking the replacements one by one, and there is something happening that shouldn't.
My original script (French.ocr), the only one I selected in my parameters, doesn't correct the I/l OCR mistakes for the moment because English/French corrections on this shouldn't be the same and are sometimes contradictory.
Now what happens is that though my script doesn't include I (uppercase I) / l (lowercase l) corrections at all, when I check OCR errors, it still takes the Default.ocr replacements into account, while it shouldn't.
I tried to rename my script into Default.ocr (while renaming the old one by _Default.ocr), but it doesn't change anything: the replacements of the original Default.ocr are taken into account anyway.
I tried to restart SW, select the script before I open the subtitles file to be sure, but it didn't help.
Is there something I'm missing here (probably :) )?
Thank you again for your help.
Subtitle Workshop uses only the currently selected OCR script, it can't use your French.ocr and the Default.ocr files both. That's actually a feature I want to implement, but I still haven't.
Are you sure you haven't copied the L/I OCR script code into your French.ocr by mistake?
Can you attach your French.ocr file so that I could check it out?