BlockingFramework / Discussion / General Discussion: JVM out of memory on EdgePruning step with a large dataset of 800,000 records

Gerard - 2014-10-22

Hi George,

The MetaBlockingOnDisk.java class we worked on last week failed on a large dirty ER dataset of 800K records. The JVM was out of memory. -Xmx parameter was adjusted up to 7GB, but did not fix the problem.

The program uses Meta-Blocking with WEP and EJS scheme. ComparisonBasedPurging step worked fine, but the issue is on the EdgePruning step.

EdgePruning ep = new EdgePruning(scheme); ep.applyProcessing(blocks);

Any thoughts on how to fix the problem?

Thanks.

Gerard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "General Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Discussion"

Anonymous - 2014-10-23

Hi Gerard,

this is a common problem with WEP and WNP. The reason is that both methods retain a large number of comparisons over large datasets. There are two ways to address it:

1) If you just want to count the retained comparisons and the identified duplicates, use the classes MetaBlocking.OnTheFlyEdgePruning and MetaBlocking.OnTheFlyNodePruning, respectively.

2) If you want to execute the retained comparisons with an entity matching method, use MetaBlocking.EdgePruningIntegratedMatching and MetaBlocking.NodePruningIntegratedMatching, respectively. Currently, they both employ the Jaccard similarity, but you can replace it with another entity matching technique.

None of these classes stores any comparison in memory, cause it's practically impossible in the context of large datasets. The only way to store the retained comparisons is to use the disk, which will be very time-consuming. I have no code for this, cause I never had to store so many comparisons explicitly.

Hope this helps.

Best regards,
George

Hi Gerard, this is a common problem with WEP and WNP. The reason is that both methods retain a large number of comparisons over large datasets. There are two ways to address it: 1) If you just want to count the retained comparisons and the identified duplicates, use the classes MetaBlocking.OnTheFlyEdgePruning and MetaBlocking.OnTheFlyNodePruning, respectively. 2) If you want to execute the retained comparisons with an entity matching method, use MetaBlocking.EdgePruningIntegratedMatching and MetaBlocking.NodePruningIntegratedMatching, respectively. Currently, they both employ the Jaccard similarity, but you can replace it with another entity matching technique. None of these classes stores any comparison in memory, cause it's practically impossible in the context of large datasets. The only way to store the retained comparisons is to use the disk, which will be very time-consuming. I have no code for this, cause I never had to store so many comparisons explicitly. Hope this helps. Best regards, George

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "General Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Discussion"

Anonymous - 2014-10-23

Hi George,

Thanks again. Helps quite a bit.

Quick question. Is using Top-K Edges or k-nearest entities an option if these approaches do not have the same memory constraints as WEP and WNS, and the effectiveness is comparable?

Of the 4 meta-blocking approaches, which the most effective approaches ? Are there any guidelines on which approaches to use for dirty ER?

Best,

Gerard

Last edit: Anonymous 2014-10-23

Hi George, Thanks again. Helps quite a bit. Quick question. Is using Top-K Edges or k-nearest entities an option if these approaches do not have the same memory constraints as WEP and WNS, and the effectiveness is comparable? Of the 4 meta-blocking approaches, which the most effective approaches ? Are there any guidelines on which approaches to use for dirty ER? Best, Gerard

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "General Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Discussion"

Anonymous - 2014-10-23

Hi George,

I reviewed your paper "Meta-Blocking: Taking Entity Resolution to the Next Level" and Section 4.2 addresses my previous question.

Best,

Gerard

Hi George, I reviewed your paper "Meta-Blocking: Taking Entity Resolution to the Next Level" and Section 4.2 addresses my previous question. Best, Gerard

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "General Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Discussion"

Anonymous - 2014-10-24

Hi Gerard,

it is true that Top-K Edges and K-Nearest Entities have much lower memory requirements than the other two meta-blocking methods. From my experience, K-Nearest Entities in conjunction with the EJS weighting scheme usually performs the best (or very close to the best approach). This applies both to Clean-Clean and Dirty ER.

Best regards,
George

Hi Gerard, it is true that Top-K Edges and K-Nearest Entities have much lower memory requirements than the other two meta-blocking methods. From my experience, K-Nearest Entities in conjunction with the EJS weighting scheme usually performs the best (or very close to the best approach). This applies both to Clean-Clean and Dirty ER. Best regards, George

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JVM out of memory on EdgePruning step with a large dataset of 800,000 records

A framework for blocking-based Entity Resolution.

Forums

Help

JVM out of memory on EdgePruning step with a large dataset of 800,000 records

JVM out of memory on EdgePruning step with a large dataset of 800,000 records

A framework for blocking-based Entity Resolution.

Forums

Help

JVM out of memory on EdgePruning step with a large dataset of 800,000 records document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

JVM out of memory on EdgePruning step with a large dataset of 800,000 records