Jaro internals

2006-10-31
2013-04-25
  • ReverendSam
    ReverendSam
    2006-10-31

    Posted here to open discussion

    Sam

    I think I've found an issue with your code in that what you have implemented for the Jaro distance metric (Jaro.java / Jaro.cs) does not correspond to the forumula you have listed for Jaro on your webpage (reference http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#jarowinkler\).

    (The formula you're trying to implement is attached as the GIF taken from your webpage.)

    In your code, you calculate the final Jaro distance as below:

    //calculate jaro metric
    transpositions /= 2;

    double tmp1;
    tmp1 = commonMatches / (3.0 * firstWord.Length) + commonMatches / (3.0 * secondWord.Length) +
              (commonMatches - transpositions) / (3.0 * commonMatches);
    return tmp1;

    However, the last term in the equation [(|s'| - Ts',t') / (2 |s'|)] does not seem to be captured with the above code. Even though you're dividing the transpositions value in half, it's the whole numerator the needs to be divided by two--not just the transpositions term. I checked the SecondString library and it makes the same error… Unless the error is in the equation you have listed (I can't know for sure since I haven't been able to locate the original Jaro papers).

    The SecondString java source for Jaro is below. Note that it divides the # of transpositions in half as well at a prior point in the code.

    double dist =
                             ( common1.length()/((double)str1.length()) +
                                     common2.length()/((double)str2.length()) +
                                     (common1.length()-transpositions)/((double)common1.length()) ) / 3.0;

    Can you shed some light on this?

     
    • ReverendSam
      ReverendSam
      2006-10-31

      The Jaro and Jaro Winkler versions are somewhat of a mess, as although the original paper lists the formula as described on my website, it has 'evolved' over time.

      If you compare the implementations inside various common databases for example they all return minorly discrepiant results.

      There are as far as I know at least three different implementations of the Jaro and Jaro-Winkler algorythms - The one selected in the code is the most commonly reported and seemingly used, I guess second String also made the same choice?

      Ideally it would be nice to implement all versions of the code and add each into the library but I was unsure if this would be even more confusing. If you provide a corrected version of the code with a new name say JaroOriginal in .NET and java I would be happy to add it into the repository.