Deduper is a simple command line java tool to remove duplicate customer records. Customer duplicate records could be very tricky. They suffer the problems such
as abbreviating the address, typos and various possible representation of same address and name.
Say for eg.
Both mean the same thing
similary, in the below example both refer to the same thing, but there is a typo and also an abbreviation in place
Even with powerful computers, it is difficult to identify these duplicates. Deduper uses modified blocking nearest neighbor based clustering to identify possible duplicates.
venki@venki-Studio-1535:~/javaworkspace/deduper/build$ java -jar deduper.jar
Deduper
===================================
USAGE : java -Xmx2G -jar deduper.jar <customer_csv> <blocksize> <radius>
customer_csv : csv file, format: customer_id|postalCode|concatenatedAddress
blocksize : used for bucketing, default: 6
radius : allowed differences between addresses, default: 5
The customer_csv must contain three fields, separated by pipe ('|")
customer_id|postalCode|concatenatedAddress
customer_id:current Unique identifier of customer record
postalCode:postal code of the customer address
concatenatedAddress: concatenated address fields, for readability use "\t" to separate address fields.
java -jar deduper.jar customer_merge.csv 6 4
The output will contain only possible duplicates, the output will be saved in the file called clusters_<input_file> . say for e.g clusters_customer_merge.csv</input_file>
the fields of the output file will be new_customer_id|old_customer_id|address
new_cluster_id: possible duplicates will be assigned same new_customer_id.