We looked for some more information on how exactly the de-duplication functionalty for WoS works, but haven't been able to find it. What exactly is being checked here to ascertain whether or not a reference is a duplicate? Which fields are being checked? Any fuzzy search involved? Is anything done with the First and Middle names of authors, with different spellings of journal titles, etc? Is a log created in the process, and if so where can we find it (errors.log.txt always ends up empty and would presumably not give a full log)?
Especially now that we can convert different formats into WoS plain text files in CiteSpace, and can merge them - deduping obviously becomes more of an issue. Some regex functionality, and/or comparison option whereby users can manually confirm of diconfirm whether automatically suggested author or journal names that are differently spelled or abbreviated should be merged or not would certainly be extremely useful. And incidentally, maybe you could then even allow users to upload these manually validated concordance tables to allow us as a community - over time - to solve this issue once and for all? Or are these proprietary abbreviations of journals and names somehow also copyrighted?
[BTW I should also add that we often find strange results suggesting there is noi overlap, although we know there is overlap.]
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Currently, the following fields are compared in corresponding WoS fields: AU, PY, VL, PG. A function for more in-depth comparisons and standarizations of references is in the Database Interface: Resolve Discrepant References. The two functions are currently in different places. I will review them with reference to other data sources such as Scopus, Dimensions, and Lens. It may be a good idea to check DOI, Pubmed ID, as well as the title for those records with valid data. Perhaps I will add an advanced version to allow users to select more checkpoints.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for getting back to us so quickly. But so a few points: it's clear now which fields are checked - but what machting algo is used and how does it decide which single version to keep/process? where can we find more info on that Resolve Discrepant References and which database interface are you talking about exactly? if you need any help with figuring out (and/or documenting) the various fields etc, please don't hesitate to call on us that (even more) advanced version sounds fantastic!
Thanks much again!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just wondering whether it is already possible now for us to use more 'checkpoints' to find duplicates. And to use various 'tricks' - like fuzzy search; or the ability to trim first and middle names to just initials; etc. Another great option would be to be able to decide what to do with those dupes (i.e. which one to keep - e.g. the most complete one, the one that has an abstract, etc.)
Also some of my other questions remain unanswered...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Professor, could you please clarify what specifically does it mean when during merging and deduplicating the datasets from WoS and Scopus: "duplicates" and "discarded"? The number next to "discarded" is larger so does it mean that the duplicates are removed and some other items?
Thanks a lot!
Total Records Found: 4604
4213 Article
9 Book
75 Book Review
4 Chronology
61 Editorial Material
7 Letter
1 Meeting
12 Meeting Abstract
4 News Item
6 Note
6 Other
170 Proceedings Paper
1 Report
2 Reprint
33 Review
Total Unique Records (Article): 3092
Duplicates: 1121
Discarded: 1512
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not much movement on this thread here :). I just read through the (IMO quite impressive) CWTS-paper where they did a large-scale comparison of Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. In that paper, they also describe the algorithm they used to match documents across all of these sources. Couldn't such an (IMO both sensible AND implementable) algorithm also be included in CiteSpace? Please note that I have also asked Ludo et. al about this here.
Last edit: Stephan De Spiegeleire 2020-06-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am sorry to keep coming back to this, but we would really still like to get some details on the matching algo in citespace and how users might be able to make some changes in it. Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We looked for some more information on how exactly the de-duplication functionalty for WoS works, but haven't been able to find it. What exactly is being checked here to ascertain whether or not a reference is a duplicate? Which fields are being checked? Any fuzzy search involved? Is anything done with the First and Middle names of authors, with different spellings of journal titles, etc? Is a log created in the process, and if so where can we find it (errors.log.txt always ends up empty and would presumably not give a full log)?
Especially now that we can convert different formats into WoS plain text files in CiteSpace, and can merge them - deduping obviously becomes more of an issue. Some regex functionality, and/or comparison option whereby users can manually confirm of diconfirm whether automatically suggested author or journal names that are differently spelled or abbreviated should be merged or not would certainly be extremely useful. And incidentally, maybe you could then even allow users to upload these manually validated concordance tables to allow us as a community - over time - to solve this issue once and for all? Or are these proprietary abbreviations of journals and names somehow also copyrighted?
[BTW I should also add that we often find strange results suggesting there is noi overlap, although we know there is overlap.]
Currently, the following fields are compared in corresponding WoS fields: AU, PY, VL, PG. A function for more in-depth comparisons and standarizations of references is in the Database Interface: Resolve Discrepant References. The two functions are currently in different places. I will review them with reference to other data sources such as Scopus, Dimensions, and Lens. It may be a good idea to check DOI, Pubmed ID, as well as the title for those records with valid data. Perhaps I will add an advanced version to allow users to select more checkpoints.
Thanks for getting back to us so quickly. But so a few points:
it's clear now which fields are checked - but what machting algo is used and how does it decide which single version to keep/process?
where can we find more info on that Resolve Discrepant References and which database interface are you talking about exactly?
if you need any help with figuring out (and/or documenting) the various fields etc, please don't hesitate to call on us
that (even more) advanced version sounds fantastic!
Thanks much again!
Just wondering whether it is already possible now for us to use more 'checkpoints' to find duplicates. And to use various 'tricks' - like fuzzy search; or the ability to trim first and middle names to just initials; etc. Another great option would be to be able to decide what to do with those dupes (i.e. which one to keep - e.g. the most complete one, the one that has an abstract, etc.)
Also some of my other questions remain unanswered...
Dear Professor, could you please clarify what specifically does it mean when during merging and deduplicating the datasets from WoS and Scopus: "duplicates" and "discarded"? The number next to "discarded" is larger so does it mean that the duplicates are removed and some other items?
Thanks a lot!
Total Records Found: 4604
4213 Article
9 Book
75 Book Review
4 Chronology
61 Editorial Material
7 Letter
1 Meeting
12 Meeting Abstract
4 News Item
6 Note
6 Other
170 Proceedings Paper
1 Report
2 Reprint
33 Review
Total Unique Records (Article): 3092
Duplicates: 1121
Discarded: 1512
Any progress on a possibly better 4-way (WoS/Scopus/Dimensions/Lens) deduplication option with more user input?
Not much movement on this thread here :). I just read through the (IMO quite impressive) CWTS-paper where they did a large-scale comparison of Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. In that paper, they also describe the algorithm they used to match documents across all of these sources. Couldn't such an (IMO both sensible AND implementable) algorithm also be included in CiteSpace? Please note that I have also asked Ludo et. al about this here.
Last edit: Stephan De Spiegeleire 2020-06-03
I am sorry to keep coming back to this, but we would really still like to get some details on the matching algo in citespace and how users might be able to make some changes in it. Thanks!