Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Good metric to perform fuzzy matching without considering words order

ellsworth
2014-03-06
2014-12-07
  • ellsworth
    ellsworth
    2014-03-06

    I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.

    For example:

    'john rambo' == 'jovn rambo'
    'john rambo' == 'rambo jovn'
    'john rambo' == 'john rambo x'
    'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'
    

    but

    'john kerry' != 'john rambo'
    

    I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).

     
  • mpkorstanje
    mpkorstanje
    2014-12-07

    Hi.

    What you want Sounds like Simon White. We have it over at the github repository.

    https://github.com/nickmancol/simmetrics

    On github in my own fork I've also got an experimental build that allows you to use the word-2gram tokenizer with Dice which provides are very comparable result.