Problem training a new font

Help
2012-07-08
2013-04-25
  • Philip Devine

    Philip Devine - 2012-07-08

    I went through the tutorial for training tesseract 3.x, and got to the point where I had a training file, but when I tried running tesseract on the tiff file, it gave the error "Assert failed: in file ../classify/adaptmatch.cpp line 512"

    I may have done something wrong along the way though, one reason I think that is because I would think that to train a new english font, I would run combine with the regular eng. file and my new tr file, but I don't have just a plain eng.tr file.  Here is the process and command I took to train it:

    Generated box file, edited errors by hand.  
    Then I created a font_properties.txt file with just "dia 0 0 0 0 0".
    Then I ran the generate unicharset executable on my box file.
    Then I ran "tesseract eng.dia.exp0.tiff eng.dia.exp0 nobatch box.train" which had a few failures but deleted them.  No fatalities
    Then I ran "mftraining -F font_properties.txt -U unicharset -0 dia.unicharset eng.dia.exp0.tr"
    Then I ran "cntraining eng.dia.exp0.tr"
    Then I tried two things.  I tried "combine_tessdata eng." which failed because there were no eng files, not surprising.  So I ran combine_tessdata dia." which ran fine, and outputed the offsets.  Weird thing was that offset 0 was -1, while offset 1 was 108, and the rest were -1.  I thought offset 0 should be a non-negative number.

    After that, I tried running tesseract with -l dia, and it fails with the error above. 

    If you see where I went wrong here, it would be great if you could let me know!  Thanks

     
  • Peter Edmond

    Peter Edmond - 2012-07-09

    It'd be nice if the errors were a bit more helpful…..but it works well when it's set up correctly.

    I suspect that you've forgot to rename all the files that you created as part of the training process. In your case with the chosen language being 'eng' rename:

    inttemp -> eng.inttemp
    Microfeat -> eng.Microfeat
    normproto -> eng.normproto
    pffmtable -> eng.pffmtable
    unicharset -> eng.unicharset

    http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Putting_it_all_together

    Let us know how it goes:)

     
  • Peter Edmond

    Peter Edmond - 2012-07-09

    Obviously if you want the language to be dia, then you prefix the file names with dia.

     
  • Philip Devine

    Philip Devine - 2012-07-09

    thanks for the response! I actually found that out last night after searching a bit more, and it made a more complete trainedata file, but still the same error.  I think my problem is that I am naming it "eng.dia.exp0", because techincally the language is still "eng" but my new font is "dia".  Should I just create a whole new language even though I am only really training a font?  Thanks for the help so far!

     
  • Peter Edmond

    Peter Edmond - 2012-07-09

    I guess the question is how you're going to use the trained data. If it's a font, then make sure you use the font_properties file as part of your training…and the language is eng. Having multiple fonts within the 'eng' language is fine. You then combine_tessdata. (Remember that the language includes not only the OCR information, but also language information such as dictionary words etc).

    If, as in my case, you know that a specific bit of text is always going to be in your chosen font, then by all means set it up as a separate language. You can specify the language that you wish to use. In Java:

    Tesseract instance = Tesseract.getInstance();
    instance.setLanguage("dia");

    The training process is very sensitive to the correct format of the file names (and errors aren't that helpful), so I suggest decide on either dia or eng as the language that you wish to use, and go with that.

     
  • Philip Devine

    Philip Devine - 2012-07-09

    How would I go about combining eng with my font?  I see that combine_tessdata takes in tr files, but I don't have a tr file for eng, only trained data.  I created the font properties file, and that part seemed to go through ok.  Any idea what I'm missing as far as combining this with eng? 

     
  • Peter Edmond

    Peter Edmond - 2012-07-09

    Nope - annoyingly I can't find it at the moment. If I find a download then I'll post a link.

    Of course, you can always go for the dia approach?

     

Log in to post a comment.