Menu

JVM out of memory on EdgePruning step with a large dataset of 800,000 records

Gerard
2014-10-22
2014-10-24
  • Gerard

    Gerard - 2014-10-22

    Hi George,

    The MetaBlockingOnDisk.java class we worked on last week failed on a large dirty ER dataset of 800K records. The JVM was out of memory. -Xmx parameter was adjusted up to 7GB, but did not fix the problem.

    The program uses Meta-Blocking with WEP and EJS scheme. ComparisonBasedPurging step worked fine, but the issue is on the EdgePruning step.

                EdgePruning ep = new EdgePruning(scheme);
                ep.applyProcessing(blocks);
    

    Any thoughts on how to fix the problem?

    Thanks.

    Gerard

     
  • Anonymous

    Anonymous - 2014-10-23

    Hi Gerard,

    this is a common problem with WEP and WNP. The reason is that both methods retain a large number of comparisons over large datasets. There are two ways to address it:

    1) If you just want to count the retained comparisons and the identified duplicates, use the classes MetaBlocking.OnTheFlyEdgePruning and MetaBlocking.OnTheFlyNodePruning, respectively.

    2) If you want to execute the retained comparisons with an entity matching method, use MetaBlocking.EdgePruningIntegratedMatching and MetaBlocking.NodePruningIntegratedMatching, respectively. Currently, they both employ the Jaccard similarity, but you can replace it with another entity matching technique.

    None of these classes stores any comparison in memory, cause it's practically impossible in the context of large datasets. The only way to store the retained comparisons is to use the disk, which will be very time-consuming. I have no code for this, cause I never had to store so many comparisons explicitly.

    Hope this helps.

    Best regards,
    George

     
  • Anonymous

    Anonymous - 2014-10-23

    Hi George,

    Thanks again. Helps quite a bit.

    Quick question. Is using Top-K Edges or k-nearest entities an option if these approaches do not have the same memory constraints as WEP and WNS, and the effectiveness is comparable?

    Of the 4 meta-blocking approaches, which the most effective approaches ? Are there any guidelines on which approaches to use for dirty ER?

    Best,

    Gerard

     

    Last edit: Anonymous 2014-10-23
  • Anonymous

    Anonymous - 2014-10-23

    Hi George,

    I reviewed your paper "Meta-Blocking: Taking Entity Resolution to the Next Level" and Section 4.2 addresses my previous question.

    Best,

    Gerard

     
  • Anonymous

    Anonymous - 2014-10-24

    Hi Gerard,

    it is true that Top-K Edges and K-Nearest Entities have much lower memory requirements than the other two meta-blocking methods. From my experience, K-Nearest Entities in conjunction with the EJS weighting scheme usually performs the best (or very close to the best approach). This applies both to Clean-Clean and Dirty ER.

    Best regards,
    George

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.