I modified the SyntheticDatasetsOnDisk.java class to add Meta-Blocking and Block Post-Processing steps, and ran into the following error:
Weighting scheme : CBS
Exception in thread "main" java.lang.ClassCastException: DataStructures.DecomposedBlock cannot be cast to DataStructures.BilateralBlock
at EfficiencyLayer.BlockRefinement.SizeBasedBlockPurging.getMaxInnerBlockSize(SizeBasedBlockPurging.java:55)
at EfficiencyLayer.BlockRefinement.SizeBasedBlockPurging.applyProcessing(SizeBasedBlockPurging.java:32)
at Experiments.MetaBlockingOnDisk.main(MetaBlockingOnDisk.java:70)
Java Result: 1
It appears that I need to post this message and only then can attach the source code and log file in a separate posting. Strange.
Question: Should I only select only one WeightingScheme?
When I selected only one WeigtingScheme (e.g. ARCS), the next error I get is:
The entity index is incompatible with a set of decomposed blocks!
Its functionalities can be carried out with same efficiency through a linear search of all comparisons!
Exception in thread "main" java.lang.NullPointerException
at DataStructures.EntityIndex.isRepeated(EntityIndex.java:242)
at EfficiencyLayer.ComparisonRefinement.ComparisonPropagation.applyProcessing(ComparisonPropagation.java:46)
at Experiments.MetaBlockingOnDisk4.main(MetaBlockingOnDisk4.java:86)
Java Result: 1
The above error is coming from the lines:
ComparisonPropagation cp = new ComparisonPropagation();
cp.applyProcessing(blocks);
The major issue I see is that there is no difference in the number of blocks even after the SizeBasedBlockPurging, EdgePruning, ComparisonsBasedBlockPurging steps. I expected several blocks be purged because the dataset has lot of duplicates.
Can you review the code and let me know if I am making any obvious mistake please. My objective is to do a simple deduplication of a dirty ER dataset.
Best Regards,
Gerard
Last edit: Anonymous 2014-10-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Attached are the source file and log file for issue I posted above. As I reviewed the code again, it appears I am missing the final comparison for deduplication.
In looking at one of your experiments and the attached gif file, is the final step of the ER process to call the method ExecuteBlockComparisons and comparisonExecution? Should I compare the remaining blocks after the 6th step vs. the original list of EntityProfiles?
It appears ExecuteBlockComparisons is the critical step and the one which is causing the overall cpu load. Please confirm. Is there a way to parallelize this step using Map Reduce?
I send you an updated class that should do the job. In the one you send me, you applied ComparisonsBasedBlockPurging after meta-blocking. That's not correct. There are no high frequency terms/blocks to prune after meta-blocking. Block Purging should always be applied before meta-blocking. In the image you have uploaded, Block Purging appears in the third step only for simplifying the structure of the workflow.
Also, you applied Comparison Propagation after meta-blocking. This is not necessary. Meta-blocking inherently removes all redundant comparisons, as the blocking graph has no parallel edges. Only in the case of node-centric pruning algorithms, does it make sense to apply Comparisons Propagation.
In addition, the method SizeBasedBlockPurging is applicable only to Clean-Clean ER. The dataset you used is a Dirty ER one. In any case, though, ComparisonsBasedBlockPurging should be preferred over SizeBasedBlockPurging.
Finally, the workflow in the image corresponds to Clean-Clean ER and is not directly applicable to the dataset you are usinig. For Dirty ER, it suffices to apply Token Blocking + Block Purging + Meta-blocking + Comparison Propagation when necessary. The workflow in the image includes some methods that are useful for Clean-Clean ER, but are optimistic, yielding higher performance than you would get in practice. I suggest using the same workflow as for Dirt ER: Token Blocking + Block Purging + Meta-blocking + Comparison Propagation when necessary.
I send you an updated file, because the previous one throws an exception.
The problem was that it tried to apply the second meta-blocking method to the outcome of the first one. That's why I moved the extraction of the blocks inside the loop.
Hi George,
I modified the SyntheticDatasetsOnDisk.java class to add Meta-Blocking and Block Post-Processing steps, and ran into the following error:
It appears that I need to post this message and only then can attach the source code and log file in a separate posting. Strange.
Question: Should I only select only one WeightingScheme?
When I selected only one WeigtingScheme (e.g. ARCS), the next error I get is:
The above error is coming from the lines:
The major issue I see is that there is no difference in the number of blocks even after the SizeBasedBlockPurging, EdgePruning, ComparisonsBasedBlockPurging steps. I expected several blocks be purged because the dataset has lot of duplicates.
Can you review the code and let me know if I am making any obvious mistake please. My objective is to do a simple deduplication of a dirty ER dataset.
Best Regards,
Gerard
Last edit: Anonymous 2014-10-16
Hi George,
Attached are the source file and log file for issue I posted above. As I reviewed the code again, it appears I am missing the final comparison for deduplication.
In looking at one of your experiments and the attached gif file, is the final step of the ER process to call the method ExecuteBlockComparisons and comparisonExecution? Should I compare the remaining blocks after the 6th step vs. the original list of EntityProfiles?
It appears ExecuteBlockComparisons is the critical step and the one which is causing the overall cpu load. Please confirm. Is there a way to parallelize this step using Map Reduce?
Thanks,
Gerard
Last edit: Anonymous 2014-10-16
Hi Gerard,
I send you an updated class that should do the job. In the one you send me, you applied ComparisonsBasedBlockPurging after meta-blocking. That's not correct. There are no high frequency terms/blocks to prune after meta-blocking. Block Purging should always be applied before meta-blocking. In the image you have uploaded, Block Purging appears in the third step only for simplifying the structure of the workflow.
Also, you applied Comparison Propagation after meta-blocking. This is not necessary. Meta-blocking inherently removes all redundant comparisons, as the blocking graph has no parallel edges. Only in the case of node-centric pruning algorithms, does it make sense to apply Comparisons Propagation.
In addition, the method SizeBasedBlockPurging is applicable only to Clean-Clean ER. The dataset you used is a Dirty ER one. In any case, though, ComparisonsBasedBlockPurging should be preferred over SizeBasedBlockPurging.
Finally, the workflow in the image corresponds to Clean-Clean ER and is not directly applicable to the dataset you are usinig. For Dirty ER, it suffices to apply Token Blocking + Block Purging + Meta-blocking + Comparison Propagation when necessary. The workflow in the image includes some methods that are useful for Clean-Clean ER, but are optimistic, yielding higher performance than you would get in practice. I suggest using the same workflow as for Dirt ER: Token Blocking + Block Purging + Meta-blocking + Comparison Propagation when necessary.
Hope this helps.
Best regards,
George
I send you an updated file, because the previous one throws an exception.
The problem was that it tried to apply the second meta-blocking method to the outcome of the first one. That's why I moved the extraction of the blocks inside the loop.
Best regards,
George