Download Latest Version BlockingFramework.zip (38.5 MB)
Email in envelope

Get an email when there's a new version of BlockingFramework

Home / newCCERdatasets
Name Modified Size InfoDownloads / Week
Parent folder
imdbProfilesNEW 2021-10-07 988.5 kB
dblpProfiles 2021-10-07 596.9 kB
buyProfiles 2021-10-07 170.5 kB
dblpAcmIdDuplicates 2021-10-07 31.3 kB
amazonWalmartIdDuplicates 2021-10-07 12.1 kB
amazonProfiles2 2021-10-07 4.6 MB
amazonProfiles 2021-10-07 1.9 MB
amazonGpIdDuplicates 2021-10-07 15.6 kB
acmProfiles 2021-10-07 537.6 kB
abtProfiles 2021-10-07 402.4 kB
abtBuyIdDuplicates 2021-10-07 15.2 kB
walmartProfiles 2021-10-07 524.4 kB
tvdbProfiles 2021-10-07 1.8 MB
tmdbTvdbIdDuplicates 2021-10-07 15.4 kB
tmdbProfiles 2021-10-07 1.6 MB
scholarProfiles 2021-10-07 12.3 MB
restaurantsIdDuplicates 2021-10-07 1.4 kB
restaurant2Profiles 2021-10-07 341.0 kB
restaurant1Profiles 2021-10-07 51.0 kB
moviesIdDuplicates 2021-10-07 320.2 kB
imdbTvdbIdDuplicates 2021-10-07 15.1 kB
imdbTmdbIdDuplicates 2021-10-07 27.7 kB
dbpediaProfiles 2021-10-07 30.3 MB
imdbProfiles 2021-10-07 6.6 MB
gpProfiles 2021-10-07 1.2 MB
dblpProfiles2 2021-10-07 572.7 kB
dblpScholarIdDuplicates 2021-10-07 32.4 kB
Totals: 27 Items   65.0 MB 0
We provide several datasets as benchmarks for testing the performance of our framework. They are divided in three categories. 

i) The first category comprises 6 real-world datasets that pertain to Clean-Clean ER. 
ii) The second category comprises 6 real-world datasets that pertain to Dirty ER. Each of them corresponds to one of the Clean-Clean ER datasets and was created by merging the individual duplicate-free entity collections into a single dirty one.
iii) The third category comprises 7 synthetic census datasets that pertain to Dirty ER. 

The size of all datasets varies from few thousand entities to several millions. Their exact technical characteristics can be found in the attached excel file. In every case, the entity profiles are in the form of a List<EntityProfile> Java object, while the ground-truth is in the form of a HashSet<IdDuplicates> Java object.

You may freely use the datasets and code for research purposes, provided that you acknowledge the authors with the following reference: 

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-Blocking: Taking Entity Resolution to the Next Level. In EEE Transactions on Knowledge and Data Engineering (TKDE), volume 26, number 8, pp. 1946-1960 (2014).
Source: README.txt, updated 2016-09-09