Menu

#14 jaro halflen should be max(len1,len2) / 2 - 1

closed-fixed
None
5
2011-12-14
2011-07-07
Sok Ann Yap
No

In stringcmp.py, jaro halflen is currently set to:

halflen = max(len1,len2) / 2 + 1

According to http://en.wikipedia.org/wiki/Jaro–Winkler_distance, it should be:

halflen = max(len1,len2) / 2 - 1

With test case "chunkumwong" and "ckwong", both python-Levenshtein and lingpipe return 0.4797979797979798, while febrl returns 0.7373737373737372. Changing the plus to minus will make febrl returns the same score.

Discussion

  • Sok Ann Yap

    Sok Ann Yap - 2011-07-07

    Note that python-Levenshtein also has a bug in its halflen calculation:

    https://github.com/miohtama/python-Levenshtein/issues/1

     
  • Sok Ann Yap

    Sok Ann Yap - 2011-07-12

    Just found this line in comparison.py:

    halflen = max(len1,len2) / 2 -1 # Or + 1 ?? PC 3/11/2006

    It looks like comparison.py and stringcmp.py each has an implementation for jaro-winkler...

     
  • Peter Christen

    Peter Christen - 2011-12-14
    • status: open --> closed-fixed
     
  • Peter Christen

    Peter Christen - 2011-12-14

    This has been fixed, thanks.

     
  • Peter Christen

    Peter Christen - 2011-12-14
    • assigned_to: nobody --> christenp
     

Log in to post a comment.