I want to compare an original text file with its OCR'ed result and get a percentage of
similarity. Here, diff and its ilk do not help in any way - even if one asks them to
ignore whitespace. Is anyone aware of an algorithm, script, or even a better place to
ask this question?
I was thinking of counting letters and then comparing differences but that's too
Basically, I'm after comparing the results of tweaking parameters both internal and
external to tesseract and being able to SCRIPT this comparison process. Even visually
(and manually) reading and comparing such resulting files is problematic because of
the changes in whitespace, etc.
Can anyone suggest a reference or link to what this is "officially" called?
Yeah, I think you might get some use from checking out the Damerau-Levenshtein distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance). It compares two strings and gives you the number of insertions, deletions, substitutions, etc. that two strings would need to be equal in value. It's used in spell checkers and the like, but this a general algorithm you can tune to your needs. ;)
Hmmm, now if I could just see a response from someone on how to make tesseract into a dll I'd be totally into writing pretty much exactly what you're talking about, as that's my next step.
Take a look at dwdiff . It might be good good enough for your purpose. It even prints out some stats :
old: 1662 words 1597 96% common 10 0% deleted 55 3% changed
new: 1666 words 1597 95% common 13 0% inserted 56 3% changed