Please excuse my english, i'm from germany...
i'm looking for a metric to search for duplicate data in a database...
String1: firma maier & co KG
String2: KG maier firma & co KG
i need a minimum of 90 percent match of these two strings.
the stigs are very similar, only the order is different, and the second string contains one more "KG"
Best regards Tom
If the tokens are always identical but the order is irrevevant then a cosine metric maybe a good idea as the matching of terms should be exact and the distance from a perfect match is only from terms present in one string and not the other.
The metric could also be modified to ignore duplicate terms "KG" (this simplifies the algorithm actually).
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.