The last sentence in the above abstract caught my attention. The author seems to imply that meta-blocking may not be useful if the dataset is sparse or has high-frequency tokens.
Can you provide your insights into meta-blocking efficiency with real-world datasets please. I thought removing the oversized blocks fixes the high-frequency tokens issue.
Thank you.
Gerard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
apparently the authors of the report did not apply Block Purging before using Meta-blocking. Block Purging is supposed to remove the high frequency terms and is indispensable in workflows involving meta-blocking. You can check the original paper for the run-time of all meta-blocking techniques over a large blocking graph (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6487505&tag=1). In a couple of months, I will release two new methods that accelerate the run-time of meta-blocking by 3 to 4 times, of course at a small cost in recall.
Regarding the sparseness of the dataset, meta-blocking relies heavily on redundant comparisons between entities. The more sparse a dataset is, the higher is the impact of meta-blocking on its recall. However, my impression is that sparse datasets yield much fewer comparisons than non-sparse ones and, thus, do not need meta-blocking. Applying Block Purging and Comparison Propagation should suffice.
Hope this helps.
Best regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi George,
I recently came across a research report by Tobias Ammann of the University of Zurich. URL at
http://www.ifi.uzh.ch/dbtg/teaching/thesesarch/ReportAmmannFA.pdf.
The title of the report is "Applying Meta-Blocking to Improve Efficiency in Entity Resolution". The abstract for the report is as follows:
The last sentence in the above abstract caught my attention. The author seems to imply that meta-blocking may not be useful if the dataset is sparse or has high-frequency tokens.
Can you provide your insights into meta-blocking efficiency with real-world datasets please. I thought removing the oversized blocks fixes the high-frequency tokens issue.
Thank you.
Gerard
Hi Gerard,
apparently the authors of the report did not apply Block Purging before using Meta-blocking. Block Purging is supposed to remove the high frequency terms and is indispensable in workflows involving meta-blocking. You can check the original paper for the run-time of all meta-blocking techniques over a large blocking graph (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6487505&tag=1). In a couple of months, I will release two new methods that accelerate the run-time of meta-blocking by 3 to 4 times, of course at a small cost in recall.
Regarding the sparseness of the dataset, meta-blocking relies heavily on redundant comparisons between entities. The more sparse a dataset is, the higher is the impact of meta-blocking on its recall. However, my impression is that sparse datasets yield much fewer comparisons than non-sparse ones and, thus, do not need meta-blocking. Applying Block Purging and Comparison Propagation should suffice.
Hope this helps.
Best regards,
George