I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.
'john rambo' == 'jovn rambo'
'john rambo' == 'rambo jovn'
'john rambo' == 'john rambo x'
'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'
'john kerry' != 'john rambo'
I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).
What you want Sounds like Simon White. We have it over at the github repository.
On github in my own fork I've also got an experimental build that allows you to use the word-2gram tokenizer with Dice which provides are very comparable result.