Menu

French OCR Script

Help
Anonymous
2014-01-26
2014-03-04
  • Anonymous

    Anonymous - 2014-01-26

    Hello and thank you very much for Subtitle Workshop's new version :)

    I'm trying to make an OCR script for French and I've been running into a few problems:

    1) First, it seems that the SWOCR WordChars in my script is not taken into account at all:

    a) The complete French WordChars should look as such (the '\' before '>' are not included in my ocr script):

    <SWOCR WordChars=
    "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_çÇœŒéàèùâêîôûäëïöüÉÀÈÙÂÊÎÔÛÄËÏÖÜ">

    Thinking that it wasn't taken into account because it was more than 80 characters, I then reduced it to:

    <SWOCR WordChars="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_çÇéÉàèùëïüâêîôû">

    To no avail.

    For instance, the replacement of the string 'aaa' by 'ççç' (it's a test) gives:
    ççç
    as a result.

    2) Now when entering hexa codes it works:

    <ERROR UseREOnlyToFind="True" Find="\w(\xE0)\w" ReplaceBy="a"> (Find 'à') 
    <ERROR UseREOnlyToFind="True" Find="\w(\xE7)\w" ReplaceBy="c"> (Find 'ç') 
    <ERROR UseREOnlyToFind="True" Find="\w(\xC7)\w" ReplaceBy="C"> (Find 'Ç')
    

    but if I want to replace 'Ca' by 'Ça' (a very frequent OCR mistake in French) it, of course, doesn't work (for same reasons as in 1)).

    3) Besides if I want to replace the string by the relevant hexa codes, it doesn't work either.

    For instance, the following line in your Default.ocr script doesn't work. It replaces ' "" ' by ' \x22 ', not by ' " '

    <ERROR UseREOnlyToFind="False" Find="\x22{2,}" ReplaceBy="\x22">
    

    Putting an escape '\' before '"' doesn't work either, as in

    <ERROR UseREOnlyToFind="False" Find="\x22{2,}" ReplaceBy="\"">
    

    I would be very grateful if you could help me with these problems and tell me if

    • the WordChars variable has a limited length
    • why it isn't taken into account
    • how to replace a strings by hexa codes properly (in case I have to)
    • in case I have to revise your code and recompile it (I would prefer not :) ), please tell me if I have to modify RegExpr.pas, OCRScripts.pas, or both.

    Thank you very much in advance.

     
  • Andrey Spiridonov

    Hi there.

    First of all, your problem is that your OCR script file is saved with Unicode encoding. It needs to be saved with ANSI encoding. I don't know what program you're editing your OCR script file with, but it saves it as Unicode. You need to be very careful with this and make sure you save the file as ANSI. I'd recommend using Notepad++.

    Once you're sure your file is saved as ANSI, this will work like a charm:

    <ERROR UseREOnlyToFind="False" Find="Ca" ReplaceBy="Ça">
    

    About the other issues.

    1) This line

    <ERROR UseREOnlyToFind="False" Find="\x22{2,}" ReplaceBy="\x22">
    

    is indeed wrong and doesn't work. In order to work properly, it should be like this:

    <ERROR UseREOnlyToFind="False" Find="\x22{2,}" ReplaceBy=""">
    

    2) The WordChars variable doesn't have a limited length. It IS taken into account, I explained the encoding problem above.

    3) You can't replace strings with hex codes, that's how the RegEx library that Subtitle Workshop is using is (but I'm not 100% sure, I'll have to double-check on that).

    So no need to recompile the code :)

     
  • Anonymous

    Anonymous - 2014-02-16

    Thank you very much, Mr. Spiridonov, for taking the time to answer and help me. I was using Sublime Text but wasn't aware of the original saving format. Now everything is in Unicode :)

    I'm testing my script right now, checking the replacements one by one, and there is something happening that shouldn't.

    My original script (French.ocr), the only one I selected in my parameters, doesn't correct the I/l OCR mistakes for the moment because English/French corrections on this shouldn't be the same and are sometimes contradictory.

    Now what happens is that though my script doesn't include I (uppercase I) / l (lowercase l) corrections at all, when I check OCR errors, it still takes the Default.ocr replacements into account, while it shouldn't.

    I tried to rename my script into Default.ocr (while renaming the old one by _Default.ocr), but it doesn't change anything: the replacements of the original Default.ocr are taken into account anyway.

    I tried to restart SW, select the script before I open the subtitles file to be sure, but it didn't help.

    Is there something I'm missing here (probably :) )?

    Thank you again for your help.

     
  • Andrey Spiridonov

    Subtitle Workshop uses only the currently selected OCR script, it can't use your French.ocr and the Default.ocr files both. That's actually a feature I want to implement, but I still haven't.

    Are you sure you haven't copied the L/I OCR script code into your French.ocr by mistake?

    Can you attach your French.ocr file so that I could check it out?

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.