Menu

Using DiskBased TokenBlocking

Gerard
2014-10-14
2014-10-15
  • Gerard

    Gerard - 2014-10-14

    Hi,

    I tried the InMemoryExperiments.Java today with the Dirty ER Datasets. I had no issues with 10,000 profiles, but then my laptop ran out of memory with 50,000 profiles.

    My laptop only has 6GB memory, hence I would like continue testing the Framework using the DiskBased TokenBlocking as an alternate.

    Initially I thought I could just change the "import EffectivenessLayer.MemoryBased.TokenBlocking;" to "import EffectivenessLayer.DiskBased.TokenBlocking;", but both the classes for Memory Based and Disk Based TokenBlocking are very different.

    Can you let me know how I can adapt the InMemoryExperiments.Java to DiskBasedExperiments.Java please. Specifically, how should I replace the following snippet of code:

    TokenBlocking imtb = new TokenBlocking(entityProfiles, null);
    List<AbstractBlock> blocks = imtb.buildBlocks();
    

    Thank you.

    Gerard

     
  • gpapadis

    gpapadis - 2014-10-15

    Hi Gerard,

    thanks for using my project.

    I just uploaded two new classes that demonstrate the differences between memory- and disk-based blocking methods. You can find them in the package Experiments, under the names
    SyntheticDatasetsInMemory and SyntheticDatasetsOnDisk, respectively. You should be able to run the experiments on your laptop simply by changing the variable mainDirectory so that it points to the directory where the entity profiles and the groundtruth are stored.

    In general, the disk based methods require as input an index path for every collection of input entity profiles. Hence, for Dirty ER, you only need to call DiskBased.TokenBlocking once, while for Clean-Clean ER, you should call it twice. Then, you use the class ExportBlocks to retrieve the blocks that are stored in the Lucene index. For Dirty ER, the second argument of the constructor should be null, while for Clean-Clean ER, you should give both index paths as input.

    Note, though, that 4GB of RAM (specified through the -Xmx parameter) suffice for applying Token Blocking to all synthetic datasets except for 2M (my laptop runs Windows 8).

    Let me know if you encounter any other issues.

    Best regards,
    George

     
  • Gerard

    Gerard - 2014-10-15

    Hi George,

    Thanks for going out of the way in creating the two new experiments. Very helpful. Really appreciated your efforts.

    I ran the SyntheticDatasetsOnDisk experiment and there were no issues. Performance was excellent (my laptop does have a SSD). Attached is the output of the experiment.

    I do have several other questions regarding Meta-Blocking and I will create separate threads for those questions.

    Thanks again for your help.

    Best Regards,

    Gerard

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.