I went through the tutorial for training tesseract 3.x, and got to the point where I had a training file, but when I tried running tesseract on the tiff file, it gave the error "Assert failed: in file ../classify/adaptmatch.cpp line 512"
I may have done something wrong along the way though, one reason I think that is because I would think that to train a new english font, I would run combine with the regular eng. file and my new tr file, but I don't have just a plain eng.tr file. Here is the process and command I took to train it:
Generated box file, edited errors by hand.
Then I created a font_properties.txt file with just "dia 0 0 0 0 0".
Then I ran the generate unicharset executable on my box file.
Then I ran "tesseract eng.dia.exp0.tiff eng.dia.exp0 nobatch box.train" which had a few failures but deleted them. No fatalities
Then I ran "mftraining -F font_properties.txt -U unicharset -0 dia.unicharset eng.dia.exp0.tr"
Then I ran "cntraining eng.dia.exp0.tr"
Then I tried two things. I tried "combine_tessdata eng." which failed because there were no eng files, not surprising. So I ran combine_tessdata dia." which ran fine, and outputed the offsets. Weird thing was that offset 0 was -1, while offset 1 was 108, and the rest were -1. I thought offset 0 should be a non-negative number.
After that, I tried running tesseract with -l dia, and it fails with the error above.
If you see where I went wrong here, it would be great if you could let me know! Thanks
It'd be nice if the errors were a bit more helpful…..but it works well when it's set up correctly.
I suspect that you've forgot to rename all the files that you created as part of the training process. In your case with the chosen language being 'eng' rename:
inttemp -> eng.inttemp
Microfeat -> eng.Microfeat
normproto -> eng.normproto
pffmtable -> eng.pffmtable
unicharset -> eng.unicharset
Let us know how it goes:)
Obviously if you want the language to be dia, then you prefix the file names with dia.
Will probably help as well:)
thanks for the response! I actually found that out last night after searching a bit more, and it made a more complete trainedata file, but still the same error. I think my problem is that I am naming it "eng.dia.exp0", because techincally the language is still "eng" but my new font is "dia". Should I just create a whole new language even though I am only really training a font? Thanks for the help so far!
I guess the question is how you're going to use the trained data. If it's a font, then make sure you use the font_properties file as part of your training…and the language is eng. Having multiple fonts within the 'eng' language is fine. You then combine_tessdata. (Remember that the language includes not only the OCR information, but also language information such as dictionary words etc).
If, as in my case, you know that a specific bit of text is always going to be in your chosen font, then by all means set it up as a separate language. You can specify the language that you wish to use. In Java:
Tesseract instance = Tesseract.getInstance();
The training process is very sensitive to the correct format of the file names (and errors aren't that helpful), so I suggest decide on either dia or eng as the language that you wish to use, and go with that.
How would I go about combining eng with my font? I see that combine_tessdata takes in tr files, but I don't have a tr file for eng, only trained data. I created the font properties file, and that part seemed to go through ok. Any idea what I'm missing as far as combining this with eng?
The .tr/tiff files for the eng language are available as a download somewhere, I haven't got a copy of it on this PC I'm afraid, but I'm sure that if you search for a download:
Nope - annoyingly I can't find it at the moment. If I find a download then I'll post a link.
Of course, you can always go for the dia approach?