deduper Wiki

A simple tool to merge duplicates in customer records

Brought to you by: vumaasha

Home

Deduper

Deduper is a simple command line java tool to remove duplicate customer records. Customer duplicate records could be very tricky. They suffer the problems such
as abbreviating the address, typos and various possible representation of same address and name.

Say for eg.

John Street 23
John st. 23

Both mean the same thing

similary, in the below example both refer to the same thing, but there is a typo and also an abbreviation in place

Alphan Majar
Alp. Major

Even with powerful computers, it is difficult to identify these duplicates. Deduper uses modified blocking nearest neighbor based clustering to identify possible duplicates.

Usage

 venki@venki-Studio-1535:~/javaworkspace/deduper/build$ java -jar deduper.jar 
 Deduper
 ===================================
 USAGE : java -Xmx2G -jar deduper.jar <customer_csv> <blocksize> <radius>

 customer_csv : csv file, format: customer_id|postalCode|concatenatedAddress
 blocksize    : used for bucketing, default: 6
 radius       : allowed differences between addresses, default: 5

The customer_csv must contain three fields, separated by pipe ('|")

customer_id|postalCode|concatenatedAddress

customer_id:current Unique identifier of customer record

postalCode:postal code of the customer address

concatenatedAddress: concatenated address fields, for readability use "\t" to separate address fields.

example

java -jar deduper.jar customer_merge.csv 6 4

output

The output will contain only possible duplicates, the output will be saved in the file called clusters_<input_file> . say for e.g clusters_customer_merge.csv</input_file>

the fields of the output file will be new_customer_id|old_customer_id|address

new_cluster_id: possible duplicates will be assigned same new_customer_id.