Menu

Meta-blocking Efficiency

Gerard
2014-10-15
2014-10-16
  • Gerard

    Gerard - 2014-10-15

    Hi George,

    I recently came across a research report by Tobias Ammann of the University of Zurich. URL at
    http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportAmmannFA.pdf.

    The title of the report is "Applying Meta-Blocking to Improve Efficiency in Entity Resolution". The abstract for the report is as follows:

    This report compares two implementations of meta-blocking in
    terms of runtime and memory usage, and measures the accuracy of
    meta-blocking using a subset of the Musicbrainz database. We find
    that the implementation using a reversed index is more efficient than
    the naive implementation. Furthermore, we find that the dataset in
    its current form is unsuitable for meta-blocking, due to incomplete
    records and the presence of high-frequency tokens, which cause both
    implementations to approach O(n2) runtime and memory consumption
    (n being the number of records).
    

    The last sentence in the above abstract caught my attention. The author seems to imply that meta-blocking may not be useful if the dataset is sparse or has high-frequency tokens.

    Can you provide your insights into meta-blocking efficiency with real-world datasets please. I thought removing the oversized blocks fixes the high-frequency tokens issue.

    Thank you.

    Gerard

     
  • gpapadis

    gpapadis - 2014-10-16

    Hi Gerard,

    apparently the authors of the report did not apply Block Purging before using Meta-blocking. Block Purging is supposed to remove the high frequency terms and is indispensable in workflows involving meta-blocking. You can check the original paper for the run-time of all meta-blocking techniques over a large blocking graph (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6487505&tag=1). In a couple of months, I will release two new methods that accelerate the run-time of meta-blocking by 3 to 4 times, of course at a small cost in recall.

    Regarding the sparseness of the dataset, meta-blocking relies heavily on redundant comparisons between entities. The more sparse a dataset is, the higher is the impact of meta-blocking on its recall. However, my impression is that sparse datasets yield much fewer comparisons than non-sparse ones and, thus, do not need meta-blocking. Applying Block Purging and Comparison Propagation should suffice.

    Hope this helps.

    Best regards,
    George

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.