| Name | Modified | Size | Downloads / Week | 
|---|---|---|---|
| Parent folder | |||
| 2MIdDuplicates | 2015-03-02 | 24.0 MB | |
| 10KIdDuplicates | 2015-03-02 | 122.0 kB | |
| 50KIdDuplicates | 2015-03-02 | 603.1 kB | |
| 100KIdDuplicates | 2015-03-02 | 1.2 MB | |
| 200KIdDuplicates | 2015-03-02 | 2.4 MB | |
| 300KIdDuplicates | 2015-03-02 | 3.6 MB | |
| 1MIdDuplicates | 2015-03-02 | 12.0 MB | |
| Totals: 7 Items | 44.0 MB | 0 | |
We provide several datasets as benchmarks for testing the performance of our framework. They are divided in three categories. i) The first category comprises 6 real-world datasets that pertain to Clean-Clean ER. ii) The second category comprises 6 real-world datasets that pertain to Dirty ER. Each of them corresponds to one of the Clean-Clean ER datasets and was created by merging the individual duplicate-free entity collections into a single dirty one. iii) The third category comprises 7 synthetic census datasets that pertain to Dirty ER. The size of all datasets varies from few thousand entities to several millions. Their exact technical characteristics can be found in the attached excel file. In every case, the entity profiles are in the form of a List<EntityProfile> Java object, while the ground-truth is in the form of a HashSet<IdDuplicates> Java object. You may freely use the datasets and code for research purposes, provided that you acknowledge the authors with the following reference: George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-Blocking: Taking Entity Resolution to the Next Level. In EEE Transactions on Knowledge and Data Engineering (TKDE), volume 26, number 8, pp. 1946-1960 (2014).