I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.
For example:
'john rambo' == 'jovn rambo'
'john rambo' == 'rambo jovn'
'john rambo' == 'john rambo x'
'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'
but
'john kerry' != 'john rambo'
I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On github in my own fork I've also got an experimental build that allows you to use the word-2gram tokenizer with Dice which provides are very comparable result.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.
For example:
but
I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).
Hi.
What you want Sounds like Simon White. We have it over at the github repository.
https://github.com/nickmancol/simmetrics
On github in my own fork I've also got an experimental build that allows you to use the word-2gram tokenizer with Dice which provides are very comparable result.