Download Latest Version BlockingFramework.zip (38.5 MB)
Email in envelope

Get an email when there's a new version of BlockingFramework

Home / CleanCleanERDatasets
Name Modified Size InfoDownloads / Week
Parent folder
DBPedia 2015-03-03
MoviesUpdated 2014-07-01
AmazonGoogleProducts 2014-07-01
DblpAcm 2014-07-01
DblpGoogleScholar 2014-07-01
AbtBuy 2014-07-01
Totals: 6 Items   0
We provide several datasets as benchmarks for testing the performance of our framework. They are divided in three categories. 

i) The first category comprises 6 real-world datasets that pertain to Clean-Clean ER. 
ii) The second category comprises 6 real-world datasets that pertain to Dirty ER. Each of them corresponds to one of the Clean-Clean ER datasets and was created by merging the individual duplicate-free entity collections into a single dirty one.
iii) The third category comprises 7 synthetic census datasets that pertain to Dirty ER. 

The size of all datasets varies from few thousand entities to several millions. Their exact technical characteristics can be found in the attached excel file. In every case, the entity profiles are in the form of a List<EntityProfile> Java object, while the ground-truth is in the form of a HashSet<IdDuplicates> Java object.

You may freely use the datasets and code for research purposes, provided that you acknowledge the authors with the following reference: 

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-Blocking: Taking Entity Resolution to the Next Level. In EEE Transactions on Knowledge and Data Engineering (TKDE), volume 26, number 8, pp. 1946-1960 (2014).
Source: README.txt, updated 2016-09-09