Q: Get % difference between 2 text files

  • I want to compare an original text file with its OCR'ed result and get a percentage of
    similarity. Here, diff and its ilk do not help in any way - even if one asks them to
    ignore whitespace. Is anyone aware of an algorithm, script, or even a better place to
    ask this question?

    I was thinking of counting letters and then comparing differences but that's too

    Basically, I'm after comparing the results of tweaking parameters both internal and
    external to tesseract and being able to SCRIPT this comparison process. Even visually
    (and manually) reading and comparing such resulting files is problematic because of
    the changes in whitespace, etc.

    Can anyone suggest a reference or link to what this is "officially" called?


    • Meaflux

      Yeah, I think you might get some use from checking out the Damerau-Levenshtein distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance). It compares two strings and gives you the number of insertions, deletions, substitutions, etc. that two strings would need to be equal in value. It's used in spell checkers and the like, but this a general algorithm you can tune to your needs. ;)

      Hmmm, now if I could just see a response from someone on how to make tesseract into a dll I'd be totally into writing pretty much exactly what you're talking about, as that's my next step.


    • Roger Luethi
      Take a look at dwdiff [1]. It might be good good enough for your purpose. It even prints out some stats [2]:

      old: 1662 words  1597 96% common  10 0% deleted  55 3% changed
      new: 1666 words  1597 95% common  13 0% inserted  56 3% changed

      [1] http://os.ghalkes.nl/dwdiff.html
      [2] http://www.linux.com/article.pl?sid=06/09/21/1913234