Menu

Good metric to perform fuzzy matching without considering words order

ellsworth
2014-03-06
2014-12-07
  • ellsworth

    ellsworth - 2014-03-06

    I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.

    For example:

    'john rambo' == 'jovn rambo'
    'john rambo' == 'rambo jovn'
    'john rambo' == 'john rambo x'
    'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'
    

    but

    'john kerry' != 'john rambo'
    

    I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).

     
  • mpkorstanje

    mpkorstanje - 2014-12-07

    Hi.

    What you want Sounds like Simon White. We have it over at the github repository.

    https://github.com/nickmancol/simmetrics

    On github in my own fork I've also got an experimental build that allows you to use the word-2gram tokenizer with Dice which provides are very comparable result.

     

Log in to post a comment.