Download Latest Version BlockingFramework.zip (38.5 MB)
Email in envelope

Get an email when there's a new version of BlockingFramework

Home / DirtyERDatasets / RealDatasets
Name Modified Size InfoDownloads / Week
Parent folder
cora 2015-05-28
cddb 2015-05-28
restaurant 2015-05-28
census 2015-05-28
dbpedia 2015-03-02
movies 2015-03-01
dblp-scholar 2015-03-01
dblp-acm 2015-03-01
amazon-gp 2015-03-01
abt-buy 2015-03-01
Totals: 10 Items   0
We provide several datasets as benchmarks for testing the performance of our framework. They are divided in three categories. 

i) The first category comprises 6 real-world datasets that pertain to Clean-Clean ER. 
ii) The second category comprises 6 real-world datasets that pertain to Dirty ER. Each of them corresponds to one of the Clean-Clean ER datasets and was created by merging the individual duplicate-free entity collections into a single dirty one.
iii) The third category comprises 7 synthetic census datasets that pertain to Dirty ER. 

The size of all datasets varies from few thousand entities to several millions. Their exact technical characteristics can be found in the attached excel file. In every case, the entity profiles are in the form of a List<EntityProfile> Java object, while the ground-truth is in the form of a HashSet<IdDuplicates> Java object.

You may freely use the datasets and code for research purposes, provided that you acknowledge the authors with the following reference: 

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-Blocking: Taking Entity Resolution to the Next Level. In EEE Transactions on Knowledge and Data Engineering (TKDE), volume 26, number 8, pp. 1946-1960 (2014).
Source: README.txt, updated 2016-09-09