Menu

BlockStatistics when there is no Groundtruth dataset

Gerard
2014-10-17
2014-10-21
  • Gerard

    Gerard - 2014-10-17

    Hi George,

    When processing a dirty ER dataset, we have to assume that a groundtruth dataset will not be available for comparison.

    How can you get the block statistics, specially the number of duplicates detected?

    Best Regards,

    Gerard

     
  • gpapadis

    gpapadis - 2014-10-17

    Hi Gerard,

    if I understand your question correctly, you can get the block statistics in the usual way, using the class Utilities.BlockStatistics. You only need to omit the methods that use the groundtruth in the variable abstractDP (which is set to null): getDuplicatesOfDecomposedBlocks and getDuplicatesWithEntityIndex.

    I am not sure, though, that we can talk about detected duplicates. Blocking only returns similar entities for further processing. It is the entity matching that decides whether two entities are duplicates or not. In the best case, the number of detected duplicates is equal to the number of distinct comparisons in the resulting block collection.

    Hope this helps.

    Best regards,
    George

     
  • gpapadis

    gpapadis - 2014-10-17

    What I mean is that the ground-truth is equivalent to a perfect entity matching method. It is an oracle that decides with 100% accuracy whether two entities co-occurring in a block are matching or not. That's why we can estimate the number of detected duplicates when there is a ground-truth available. In its absence, we can only talk about highly similar entities.

    Kind regards,
    George

     
  • Gerard

    Gerard - 2014-10-18

    Hi George,

    Thanks for the explanation. Based on your suggestion, I made few changes to my dedupe class, but am still missing the PC/PQ measures, in absence of ground-truth. I guess this where we need the measures to be based on highly similar entities.

    The following to lines are added to the CustProfilesDedupe class:

                final AbstractDuplicatePropagation adp = null;            
                BlockStatistics blStats = new BlockStatistics(blocks, adp);
                blStats.applyProcessingDirtyER();
    

    The following new method is added to BlockStatistics class:

        public double[] applyProcessingDirtyER() {
            System.out.println("No of blocks\t:\t" + blocks.size());
    
            double[] values = new double[2];
            if (blocks.isEmpty()) {
                values[0] = 0;
                values[1] = 0;
            } else {
                double totalComparisons = getComparisonsCardinality();
                if (blocks.get(0) instanceof DecomposedBlock) {
                    System.out.println("At getDecomposedBlocksEntities...");                
                    getDecomposedBlocksEntities(totalComparisons);
                } else {
                    System.out.println("At entityIndex...");   
                    entityIndex = new EntityIndex(blocks);
                    getEntities();
                }
                getBlockingCardinality();
    //            if (blocks.get(0) instanceof DecomposedBlock) {
    //                getDuplicatesOfDecomposedBlocks(totalComparisons);
    //            } else {
    //                getDuplicatesWithEntityIndex(totalComparisons);
    //            }
    
                values[0] = pc;
                values[1] = pc;
            }
            return values;
        }
    

    Next, I will work on a method to compute the PC/PQ measures for Dirty ER when there is no ground-truth.

    Please let me know if I am on the right track.

    Best Regards,

    Gerard

     

    Last edit: Gerard 2014-10-18
  • Anonymous

    Anonymous - 2014-10-21

    Hi Gerard,

    I am a bit puzzled. You cannot talk about PC and PQ without knowing the ground-truth. The only surrogate to a ground-truth is to apply an entity matching method and, assuming that its output (Match, Non-match) is always correct, define as true positives those pairs that are marked as MATCH.

    Best regards,
    George

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.