Re: [Refdb-users] making reality and name normalization coexist

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Of course this problem has been tackled before. But even better, some
bibliographic database manager wrote a very interesting article about
it:

"The Identification of Authors in the Mathematical Reviews Database"
<http://www.library.ucsb.edu/istl/01-summer/databases.html>

  "There was a time when Mathematical Reviews even attempted to
  "correct"  the published form of a name, perhaps believing that some
  editors and publishers just didn't try hard enough. As a survivor of
  those days, an internal Mathematical Reviews concept is that of the
  "preferred name,"...

On Fri, 9 Jan 2004, Marc Herbert wrote:

> The database, being unable to tell which is the "right" spelling, or
> worst, not even in some cases being able to tell if all these writings
> designate the same person, should carefully preserve every character
> from every typist. So the database has no choice but storing the input
> "as is". Preferably pre-parsed, but without any character lost or
> added. All these inputs become (unfortunately, but what can you do?)
> different authors.
>
> Meanwhile, a "clever" algorithm that is aware of most common
> typing-names mistakes in our culture computes a "normalized" (or
> "reduced", or "projected") representation of the given name for each
> record.

> Such a simple algorithm could be for instance:
> - throw away a set of characters (period, hyphen, apostrophe,
> space,...)
> - lowercase all characters
> - dump all diacritics
> - ...
>
> This "projected name" is stored in the author record, besides the
> typist input. It is _indexed_ and used to perform queries. It can be
> used to detect false duplicates easily and efficiently, including at
> input time! (i.e. "Don't you think you should rather write this name
> this way?")