From 5.5.R3, the following heuristics will be used to determine whether two records are duplicates and, if so, which one should be retained.
Two records are considered to be the same if they match as follows:
The first author's lastname
The first 10 letters of the title
The first 10 letters of the source
The year of publication
The DOI, if any
The first page number
Heuristics of retention
If one of them is from the Web of Science, keep that version and remove the one from other source.
If none of the records is from the Web of Science, keep the one with more information, i.e. the longer one.
Here is an example. Two records from Dimensions are found to be duplicated but with different pub IDs:
Dimensions:pub.1042802800
Dimensions:pub.1059931958
They agree in terms of the author, title, journal, DOI, and year, but the first page number is missing in both of them.
The former has 10 references with 757 citations and a Pumbed ID, whereas the latter shows now references with fewer citations (425). See below for a copy from Dimensions. The latter is much shorter as a record, as a result, the former will be retained and the latter discarded.
Dimensions:pub.1042802800
The Unified Medical Language System (UMLS): integrating biomedical terminology
Nucleic Acids Research, 32(suppl_1), d267-d270, 2004
https://doi.org/10.1093/nar/gkh061
Authors
Olivier Bodenreider - United States National Library of Medicine
Abstract
The Unified Medical Language System (http://umlsks.nlm.nih.gov) is a repository of biomedical vocabularies developed by the US National Library of Medicine. The UMLS integrates over 2 million names for some 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations among these concepts. Vocabularies integrated in the UMLS Metathesaurus include the NCBI taxonomy, Gene Ontology, the Medical Subject Headings (MeSH), OMIM and the Digital Anatomist Symbolic Knowledge Base. UMLS concepts are not only inter-related, but may also be linked to external resources such as GenBank. In addition to data, the UMLS includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap). The UMLS knowledge sources are updated quarterly. All vocabularies are available at no fee for research purposes within an institution, but UMLS users are required to sign a license agreement. The UMLS knowledge sources are distributed on CD-ROM and by FTP.
Publication references - 10
Dimensions:pub.1059931958
The Unified Medical Language System (UMLS): integrating biomedical terminology
Nucleic Acids Research, 32(90001), 267d-270, 2004
https://doi.org/10.1093/nar/gkh061
Authors
O. Bodenreider
Abstract
Not available.
Publication citations - 427
And so the Publication citations (a field that only seems present in the second record) get lost after deduplication?
Last edit: Stephan De Spiegeleire 2022-08-28