Meta-blocking Efficiency

A framework for blocking-based Entity Resolution.

Brought to you by: gap2

Meta-blocking Efficiency

Forum: General Discussion

Creator: Gerard

Created: 2014-10-15

Updated: 2014-10-16

Gerard - 2014-10-15

Hi George,

I recently came across a research report by Tobias Ammann of the University of Zurich. URL at
http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportAmmannFA.pdf.

The title of the report is "Applying Meta-Blocking to Improve Efficiency in Entity Resolution". The abstract for the report is as follows:

This report compares two implementations of meta-blocking in terms of runtime and memory usage, and measures the accuracy of meta-blocking using a subset of the Musicbrainz database. We find that the implementation using a reversed index is more efficient than the naive implementation. Furthermore, we find that the dataset in its current form is unsuitable for meta-blocking, due to incomplete records and the presence of high-frequency tokens, which cause both implementations to approach O(n2) runtime and memory consumption (n being the number of records).

The last sentence in the above abstract caught my attention. The author seems to imply that meta-blocking may not be useful if the dataset is sparse or has high-frequency tokens.

Can you provide your insights into meta-blocking efficiency with real-world datasets please. I thought removing the oversized blocks fixes the high-frequency tokens issue.

Thank you.

Gerard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpapadis - 2014-10-16

Hi Gerard,

apparently the authors of the report did not apply Block Purging before using Meta-blocking. Block Purging is supposed to remove the high frequency terms and is indispensable in workflows involving meta-blocking. You can check the original paper for the run-time of all meta-blocking techniques over a large blocking graph (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6487505&tag=1). In a couple of months, I will release two new methods that accelerate the run-time of meta-blocking by 3 to 4 times, of course at a small cost in recall.

Regarding the sparseness of the dataset, meta-blocking relies heavily on redundant comparisons between entities. The more sparse a dataset is, the higher is the impact of meta-blocking on its recall. However, my impression is that sparse datasets yield much fewer comparisons than non-sparse ones and, thus, do not need meta-blocking. Applying Block Purging and Comparison Propagation should suffice.

Hope this helps.

Best regards,
George

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.