[Refdb-users] making reality and name normalization coexist

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 7 Jan 2004, Marc Herbert wrote:

> > The database contains the name parts, plus a normalized
> > representation for speeding up queries that happens to look like some
> > formatted representation. When creating a bibliography, RefDB then has
> > to assemble the name parts in a fashion that matches the requirements
> > of the publisher.
> >
> > It is irrelevant how the cited author or the author
> > writing the paper would like to represent that name.

It is irrelevant to authoritarian stylesheets.

> if some information is lost in this great process, whatever its
> noble purpose is, some author may _never_ see his name printed as he
> would like to, even when some stylesheets allow it.

Please understand the arguments below as a very general discussion
about "what should be stored in a bibliographic database" and NOT
anymore as a discussion about "what should refdb do" or "what do you
think about the RIS format". Or else please start a new thread. Thanks
in advance.

Let me try to sum up the issues.

- there is a strong need for a "normalized" representation of names,
  to avoid false duplicates and enhance results of queries.
- some formatting tools/stylesheets "normalize" your
  names, deciding if and where you should put periods, dashes,
  initializing or not, etc.
- some less authoritarian publishers/formatting conventions leave more
  freedom about this, in order to please authors and grant them the
  right to write their (possibly "weird") name as they want.

I think it's technically possible to please everyone, by isolating
issues. Let's take the example of this problematic name:
(<http://citeseer.nj.nec.com/context/153368/0>)

   Chu, H.K. Jerry    (that's the precise way he writes it himself)

Depending on the typist (errare humanum est), the given name becomes:

 - HK Jerry
 - H.-K. Jerry
 - Hsiao Keng Jerry
 - Hsiaokeng Jerry
 - etc.

  [of course, he could become much more severely mistyped, and then
  the reasoning below will be less efficient/interesting. But anyway
  nothing will worked for severe cases except firing the typist].

The database, being unable to tell which is the "right" spelling, or
worst, not even in some cases being able to tell if all these writings
designate the same person, should carefully preserve every character
from every typist. So the database has no choice but storing the input
"as is". Preferably pre-parsed, but without any character lost or
added. All these inputs become (unfortunately, but what can you do?)
different authors.

Meanwhile, a "clever" algorithm that is aware of most common
typing-names mistakes in our culture computes a "normalized" (or
"reduced", or "projected") representation of the given name for each
record. So only two differents ones here:

 - hkjerry
 - hsiaokengjerry

Such a simple algorithm could be for instance:
- throw away a set of characters (period, hyphen, apostrophe,
space,...)
- lowercase all characters
- dump all diacritics
- ...

This "projected name" is stored in the author record, besides the
typist input. It is _indexed_ and used to perform queries. It can be
used to detect false duplicates easily and efficiently, including at
input time! (i.e. "Don't you think you should rather write this name
this way?")

The sample algorithm above is just... an example. Obviously, the
"cleverness" of the algorithm deserves more discussion (and another
thread). This algorithm could be easily configurable, for instance
depending on cultural specifities. Even better, there could be several
projections used by the database, covering different scenarios. For
instance, an concurrent "abbreviating"  algorithm that takes only
capitals could run and give:
   - HKJ
thus collapsing many more different inputs, and offering the client an
very efficient "search using initials" additional feature.

Stylesheets can pick up all the information they need (preferably
pre-parsed), and are free to normalize names as they want to at
publishing time.

Comments?