#14 jaro halflen should be max(len1,len2) / 2 - 1

closed-fixed
None
5
2011-12-14
2011-07-07
Sok Ann Yap
No

In stringcmp.py, jaro halflen is currently set to:

halflen = max(len1,len2) / 2 + 1

According to http://en.wikipedia.org/wiki/Jaro–Winkler_distance, it should be:

halflen = max(len1,len2) / 2 - 1

With test case "chunkumwong" and "ckwong", both python-Levenshtein and lingpipe return 0.4797979797979798, while febrl returns 0.7373737373737372. Changing the plus to minus will make febrl returns the same score.

Discussion

  • Sok Ann Yap
    Sok Ann Yap
    2011-07-12

    Just found this line in comparison.py:

    halflen = max(len1,len2) / 2 -1 # Or + 1 ?? PC 3/11/2006

    It looks like comparison.py and stringcmp.py each has an implementation for jaro-winkler...

     
  • Peter Christen
    Peter Christen
    2011-12-14

    • status: open --> closed-fixed
     
  • Peter Christen
    Peter Christen
    2011-12-14

    This has been fixed, thanks.

     
  • Peter Christen
    Peter Christen
    2011-12-14

    • assigned_to: nobody --> christenp