duplicates detection (again)

DrSAR
2008-09-10
2013-05-28
  • DrSAR
    DrSAR
    2008-09-10

    Hi,
    I thought I'll contribute a little this time around. The problem of duplicates detection has bugged me as much as some other people (see previous poster). The solutions discussed at http://wiki.refbase.net/index.php/Handling_of_duplicates didn't appeal to me all that much and this is why I'm hoping I can kickstart some development in that direction.

    I thought a reasonable sound method for detection would be the creation of a hash based on some decently well defined fields in the refs table using the mysql md5 function. If one were to group by those hashes and count the ones that are multiples one could then find the potential duplicates:

    SELECT md5(concat(year,volume,pages)) AS hash, group_concat(serial), count(serial) AS repetition FROM refs GROUP BY hash having repetition > 1;

    The above calculates a hash based on year, volume and pages. but that could be customizable.

    At this point the use should be presented with them and be allowed to delete the other one. I know there are plans (aren't there?)  of having some really clever merge but at this time. Most would be happy, I imagine, with a simple duplicates removal.

    The same method should work on input assuming one checks the incoming data hashes against existing ones.

    Comments?

    Stefan.

     
    • Marco
      Marco
      2008-09-11

      Hi Stefan and refBase developers!

      I am new to refBase: I am going to install it quite soon. A feature that is essential for me is avoid duplicates that come from import the same endnote or bibtex file twice.

      I understood correctly with you're approach you want to detect also entries that are not necessarily exactly the same. Is this correct?

      In my case it would enough to check if they are *exactly* the same at import time.
      This is maybe easier to implement, and as important as detecting *similar* entries.

      Kind regards,
      Marco

       
    • DrSAR
      DrSAR
      2008-09-11

      Marco,
      Yes, as it is, it will flag the ones that are equal in some of the fields that are less likely to 'be noisy' (e.g. author fields might have full names vs abbreviated names etc.) However, what I propose could be extendable to check for exact matches by simply including all the fields of interest in the concat() command.

      I have now had a closer look at the bleeding edge code. I suspect a lot is already there and I was just about to reinvent the wheel. The code on import should probably go into import_modify.php in the place where     // VALIDATE DATA FIELDS:  happens.
      Do others agree?

      Stefan

       
    • Hi Stefan & Marco,

      > I have now had a closer look at the bleeding edge code. I suspect
      > a lot is already there and I was just about to reinvent the wheel.

      Yes and no. The latest refbase version in the SVN trunk (as well as in the "bleeding-edge" branch) includes a function 'findDuplicates()' in file 'search.php'. When a user submits the 'duplicate_search.php' script, this function attempts to find all duplicate records for the given query. Author initials as well as whitespace, punctuation, character case and non-ASCII characters can be ignored before comparison. For each record, the function generates a unique string that is then used for comparison between records.

      While (IMHO) the results are quite useful already, everything is computed on run time, i.e. for large data sets, there may be speed (and other) issues.

      To improve things I propose to calculate some statistics for each record upon add/edit, and store these together with the record (or in a related MySQL table). This would allow for smarter dup detection as well as for faster query processing.

      Useful text metrics could be:
      - the number of non-stop words, characters or vowels in a particular field
      - the number of authors/editors
      - the number of occurrence of a particular letter (e.g. an "e"), etc

      These record metrics could be combined flexibly to build a unique string or hash for each record, similar to what Stefan proposed above. It would also allow to do some fuzzy searching, e.g. find all records where field contents differ only slightly (say by +/-5% or less). This could be done by comparing all numeric text metrics for a given record, then fetching all existing records whose numeric metrics are close (or identical) to the respective metrics of the test record.

      This would allow to handle the case where two records differ only in a few words or characters. A similar issue would be the case where the author names are identical but differ only in position.

      The proposed setup should also work for import. In that case, text metrics would need to be computed for each of the imported records, and to be compared with the metrics from existing records. At first, the importer could then skip records from the import file whose text metrics match those of some existing records.

      However, ideally, some smart skip/update/merge capability would be offered on duplicate detection.

      Also, when records have been identified as "original" and "duplicate", future versions of refbase could offer the user to hide any duplicate entries from the search results or to merge data from duplicate entries with those of its original entry.

      We did discuss better duplicate detection methods on the dev list some time ago. Here's an excerpt form that discussion:

      Upon batch import, duplicate identification would require some additional logic (see below) and maybe also an additional user interface to present possible duplicates *before* actual import. Upon batch import, possible duplicate records could e.g. be copied to a temporary table first and be linked to the appropriate records that already exist in the database. Then there should be a diff/merge interface where a user could decide whether a new record should be imported or not OR whether the new one should be somehow merged with the existing one.

      Speaking of this diff/merge interface, one possible solution would be to present all differing fields for two records next (or beneath) each other, then have two radio buttons in front of the fields to let the user quickly decide on a field by field basis, how the two records should be merged. If the fields would be presented as editable text, then a user could also decide to copy just some bits over to the fields of the existing record, but otherwise keep the existing record as is.

      Here's an ASCII mockup of such a diff/merge interface:

      +--------------------------------------------------------------+
      | Author ø Karnesky, R; Steffens, M; Jones, A; Stepputtis, ..  |
      |        o Karnesky, R; Steffens, MT                           |
      |                                                              |
      | Title  o refbase - a bibliographic web app         Type ø .. |
      |        ø refbase - an online reference manager          o .. |
      |                                                              |
      | Year   o 2008    Publication  ø Journal of Net Technologies  |
      |        ø 2007                 o J. Net Technol.              |
      |                                                              |
      | ...                                                          |
      +--------------------------------------------------------------+

      Matthias

       
  • Florian
    Florian
    2010-07-01

    To follow up on this discussion:

    would it make sense to automatically load the findduplicates.php right after a new article is imported via import.php for example?

    Our would the time-out be a problem for larger datasets? I dunno…