The MetaBlockingOnDisk.java class we worked on last week failed on a large dirty ER dataset of 800K records. The JVM was out of memory. -Xmx parameter was adjusted up to 7GB, but did not fix the problem.
The program uses Meta-Blocking with WEP and EJS scheme. ComparisonBasedPurging step worked fine, but the issue is on the EdgePruning step.
EdgePruning ep = new EdgePruning(scheme);
ep.applyProcessing(blocks);
Any thoughts on how to fix the problem?
Thanks.
Gerard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
this is a common problem with WEP and WNP. The reason is that both methods retain a large number of comparisons over large datasets. There are two ways to address it:
1) If you just want to count the retained comparisons and the identified duplicates, use the classes MetaBlocking.OnTheFlyEdgePruning and MetaBlocking.OnTheFlyNodePruning, respectively.
2) If you want to execute the retained comparisons with an entity matching method, use MetaBlocking.EdgePruningIntegratedMatching and MetaBlocking.NodePruningIntegratedMatching, respectively. Currently, they both employ the Jaccard similarity, but you can replace it with another entity matching technique.
None of these classes stores any comparison in memory, cause it's practically impossible in the context of large datasets. The only way to store the retained comparisons is to use the disk, which will be very time-consuming. I have no code for this, cause I never had to store so many comparisons explicitly.
Hope this helps.
Best regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Quick question. Is using Top-K Edges or k-nearest entities an option if these approaches do not have the same memory constraints as WEP and WNS, and the effectiveness is comparable?
Of the 4 meta-blocking approaches, which the most effective approaches ? Are there any guidelines on which approaches to use for dirty ER?
Best,
Gerard
Last edit: Anonymous 2014-10-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it is true that Top-K Edges and K-Nearest Entities have much lower memory requirements than the other two meta-blocking methods. From my experience, K-Nearest Entities in conjunction with the EJS weighting scheme usually performs the best (or very close to the best approach). This applies both to Clean-Clean and Dirty ER.
Best regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi George,
The MetaBlockingOnDisk.java class we worked on last week failed on a large dirty ER dataset of 800K records. The JVM was out of memory. -Xmx parameter was adjusted up to 7GB, but did not fix the problem.
The program uses Meta-Blocking with WEP and EJS scheme. ComparisonBasedPurging step worked fine, but the issue is on the EdgePruning step.
Any thoughts on how to fix the problem?
Thanks.
Gerard
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi Gerard,
this is a common problem with WEP and WNP. The reason is that both methods retain a large number of comparisons over large datasets. There are two ways to address it:
1) If you just want to count the retained comparisons and the identified duplicates, use the classes MetaBlocking.OnTheFlyEdgePruning and MetaBlocking.OnTheFlyNodePruning, respectively.
2) If you want to execute the retained comparisons with an entity matching method, use MetaBlocking.EdgePruningIntegratedMatching and MetaBlocking.NodePruningIntegratedMatching, respectively. Currently, they both employ the Jaccard similarity, but you can replace it with another entity matching technique.
None of these classes stores any comparison in memory, cause it's practically impossible in the context of large datasets. The only way to store the retained comparisons is to use the disk, which will be very time-consuming. I have no code for this, cause I never had to store so many comparisons explicitly.
Hope this helps.
Best regards,
George
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi George,
Thanks again. Helps quite a bit.
Quick question. Is using Top-K Edges or k-nearest entities an option if these approaches do not have the same memory constraints as WEP and WNS, and the effectiveness is comparable?
Of the 4 meta-blocking approaches, which the most effective approaches ? Are there any guidelines on which approaches to use for dirty ER?
Best,
Gerard
Last edit: Anonymous 2014-10-23
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi George,
I reviewed your paper "Meta-Blocking: Taking Entity Resolution to the Next Level" and Section 4.2 addresses my previous question.
Best,
Gerard
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi Gerard,
it is true that Top-K Edges and K-Nearest Entities have much lower memory requirements than the other two meta-blocking methods. From my experience, K-Nearest Entities in conjunction with the EJS weighting scheme usually performs the best (or very close to the best approach). This applies both to Clean-Clean and Dirty ER.
Best regards,
George