Re: [Refdb-users] making reality and name normalization coexist
Status: Beta
Brought to you by:
mhoenicka
From: Marc H. <Mar...@fr...> - 2004-01-14 15:47:57
|
Of course this problem has been tackled before. But even better, some bibliographic database manager wrote a very interesting article about it: "The Identification of Authors in the Mathematical Reviews Database" <http://www.library.ucsb.edu/istl/01-summer/databases.html> "There was a time when Mathematical Reviews even attempted to "correct" the published form of a name, perhaps believing that some editors and publishers just didn't try hard enough. As a survivor of those days, an internal Mathematical Reviews concept is that of the "preferred name,"... On Fri, 9 Jan 2004, Marc Herbert wrote: > The database, being unable to tell which is the "right" spelling, or > worst, not even in some cases being able to tell if all these writings > designate the same person, should carefully preserve every character > from every typist. So the database has no choice but storing the input > "as is". Preferably pre-parsed, but without any character lost or > added. All these inputs become (unfortunately, but what can you do?) > different authors. > > Meanwhile, a "clever" algorithm that is aware of most common > typing-names mistakes in our culture computes a "normalized" (or > "reduced", or "projected") representation of the given name for each > record. > Such a simple algorithm could be for instance: > - throw away a set of characters (period, hyphen, apostrophe, > space,...) > - lowercase all characters > - dump all diacritics > - ... > > This "projected name" is stored in the author record, besides the > typist input. It is _indexed_ and used to perform queries. It can be > used to detect false duplicates easily and efficiently, including at > input time! (i.e. "Don't you think you should rather write this name > this way?") |