if I understand your question correctly, you can get the block statistics in the usual way, using the class Utilities.BlockStatistics. You only need to omit the methods that use the groundtruth in the variable abstractDP (which is set to null): getDuplicatesOfDecomposedBlocks and getDuplicatesWithEntityIndex.
I am not sure, though, that we can talk about detected duplicates. Blocking only returns similar entities for further processing. It is the entity matching that decides whether two entities are duplicates or not. In the best case, the number of detected duplicates is equal to the number of distinct comparisons in the resulting block collection.
Hope this helps.
Best regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What I mean is that the ground-truth is equivalent to a perfect entity matching method. It is an oracle that decides with 100% accuracy whether two entities co-occurring in a block are matching or not. That's why we can estimate the number of detected duplicates when there is a ground-truth available. In its absence, we can only talk about highly similar entities.
Kind regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the explanation. Based on your suggestion, I made few changes to my dedupe class, but am still missing the PC/PQ measures, in absence of ground-truth. I guess this where we need the measures to be based on highly similar entities.
The following to lines are added to the CustProfilesDedupe class:
final AbstractDuplicatePropagation adp = null;
BlockStatistics blStats = new BlockStatistics(blocks, adp);
blStats.applyProcessingDirtyER();
The following new method is added to BlockStatistics class:
I am a bit puzzled. You cannot talk about PC and PQ without knowing the ground-truth. The only surrogate to a ground-truth is to apply an entity matching method and, assuming that its output (Match, Non-match) is always correct, define as true positives those pairs that are marked as MATCH.
Best regards,
George
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi George,
When processing a dirty ER dataset, we have to assume that a groundtruth dataset will not be available for comparison.
How can you get the block statistics, specially the number of duplicates detected?
Best Regards,
Gerard
Hi Gerard,
if I understand your question correctly, you can get the block statistics in the usual way, using the class Utilities.BlockStatistics. You only need to omit the methods that use the groundtruth in the variable abstractDP (which is set to null): getDuplicatesOfDecomposedBlocks and getDuplicatesWithEntityIndex.
I am not sure, though, that we can talk about detected duplicates. Blocking only returns similar entities for further processing. It is the entity matching that decides whether two entities are duplicates or not. In the best case, the number of detected duplicates is equal to the number of distinct comparisons in the resulting block collection.
Hope this helps.
Best regards,
George
What I mean is that the ground-truth is equivalent to a perfect entity matching method. It is an oracle that decides with 100% accuracy whether two entities co-occurring in a block are matching or not. That's why we can estimate the number of detected duplicates when there is a ground-truth available. In its absence, we can only talk about highly similar entities.
Kind regards,
George
Hi George,
Thanks for the explanation. Based on your suggestion, I made few changes to my dedupe class, but am still missing the PC/PQ measures, in absence of ground-truth. I guess this where we need the measures to be based on highly similar entities.
The following to lines are added to the CustProfilesDedupe class:
The following new method is added to BlockStatistics class:
Next, I will work on a method to compute the PC/PQ measures for Dirty ER when there is no ground-truth.
Please let me know if I am on the right track.
Best Regards,
Gerard
Last edit: Gerard 2014-10-18
View and moderate all "General Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Discussion"
Hi Gerard,
I am a bit puzzled. You cannot talk about PC and PQ without knowing the ground-truth. The only surrogate to a ground-truth is to apply an entity matching method and, assuming that its output (Match, Non-match) is always correct, define as true positives those pairs that are marked as MATCH.
Best regards,
George