Using DiskBased TokenBlocking

A framework for blocking-based Entity Resolution.

Brought to you by: gap2

Using DiskBased TokenBlocking

Forum: General Discussion

Creator: Gerard

Created: 2014-10-14

Updated: 2014-10-15

Gerard - 2014-10-14

Hi,

I tried the InMemoryExperiments.Java today with the Dirty ER Datasets. I had no issues with 10,000 profiles, but then my laptop ran out of memory with 50,000 profiles.

My laptop only has 6GB memory, hence I would like continue testing the Framework using the DiskBased TokenBlocking as an alternate.

Initially I thought I could just change the "import EffectivenessLayer.MemoryBased.TokenBlocking;" to "import EffectivenessLayer.DiskBased.TokenBlocking;", but both the classes for Memory Based and Disk Based TokenBlocking are very different.

Can you let me know how I can adapt the InMemoryExperiments.Java to DiskBasedExperiments.Java please. Specifically, how should I replace the following snippet of code:

TokenBlocking imtb = new TokenBlocking(entityProfiles, null); List<AbstractBlock> blocks = imtb.buildBlocks();

Thank you.

Gerard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

gpapadis - 2014-10-15

Hi Gerard,

thanks for using my project.

I just uploaded two new classes that demonstrate the differences between memory- and disk-based blocking methods. You can find them in the package Experiments, under the names
SyntheticDatasetsInMemory and SyntheticDatasetsOnDisk, respectively. You should be able to run the experiments on your laptop simply by changing the variable mainDirectory so that it points to the directory where the entity profiles and the groundtruth are stored.

In general, the disk based methods require as input an index path for every collection of input entity profiles. Hence, for Dirty ER, you only need to call DiskBased.TokenBlocking once, while for Clean-Clean ER, you should call it twice. Then, you use the class ExportBlocks to retrieve the blocks that are stored in the Lucene index. For Dirty ER, the second argument of the constructor should be null, while for Clean-Clean ER, you should give both index paths as input.

Note, though, that 4GB of RAM (specified through the -Xmx parameter) suffice for applying Token Blocking to all synthetic datasets except for 2M (my laptop runs Windows 8).

Let me know if you encounter any other issues.

Best regards,
George

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gerard - 2014-10-15

Hi George,

Thanks for going out of the way in creating the two new experiments. Very helpful. Really appreciated your efforts.

I ran the SyntheticDatasetsOnDisk experiment and there were no issues. Performance was excellent (my laptop does have a SSD). Attached is the output of the experiment.

I do have several other questions regarding Meta-Blocking and I will create separate threads for those questions.

Thanks again for your help.

Best Regards,

Gerard

BF_Run.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.