[Refdb-users] The case against <middlename>

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Marc Herbert writes:
 > Whereas the distinction between <firstname> and <lastname>, is quite=

 > shared across different cultures, since it can easily and formally
 > defined as "given" name and "family" name, the notion of <middlename=
>
 > seems very culture-specific, and its inclusion in RISX brings more
 > issues than benefits. I suggest its suppression from the RISX DTD an=
d
 > the refdb databases (just like in other similar formats)
 >=20

Up-front: This is not going to happen.

 > In english, the middle name is a "second firstname", mainly used as
 > disambiguator.  It is more an extension of the <firstname> than a
 > first order part of the whole <author>. It may be only a nickname.
 >=20

In American English, where the middle initial is probably used most
widely, the initial can also derive from the mothers maiden name or
any other family name. You can't treat this as part of the first name.

 > * RIS (!) does not have it
 > <http://www.refman.com/support/risformat_tags_02.asp>
 >=20

They show a couple of middle initials and middle names in their example=
s.

 > * Formatting/sorting/... issues for subsequent operations
 >=20
 > This is the apparent drawback. Suppressing an element means providin=
g
 > less information to subsequent tools. However, I think lack of
 > information is better than incomplete/imprecise information. IMHO,

I beg to differ. No information can't be better than partial
information. If that were true, we should stop doing research and
settle with the fact that we'll never know everything precisely.

 > <middlename> carries a refinement that belongs only to a very detail=
ed
 > level of name representation (at least as detailed as the TEI=A0mode=
l).
 > Using <middlename> together with <firstname> and <lastname> is only =
a
 > halfhearted (and thus imprecise) attempt to more deeply parse the
 > name. And as shown above, the RIS input syntax is not ready for that=
,
 > (I mean: AU - Lastname[,(F.|First)[(M.|Middle)[,Suffix]]] is not
 > "clean"), and the RISX input is buggy.

Then let's fix it.

 > - About formatting
 >=20
 > LaTeX/BibTeX for instance performs a second stage parsing
 > (part=A0->=A0tokens) that relies on spaces, capitals and dots. It al=
lows
 > automated abbreviations among others.  The user can use a "hack"
 > (escape braces {} inlined in the data) to prevent any "too clever"
 > formatting. The need for this hack proves that the automated
 > formatting may fail to address specific cases. But at least the data=

 > model is simple and thus can't be wrong: all tokens of the complete
 > given name are stored together in the same string; if one stylesheet=

 > does the formatting wrong, another one may do it right.
 > <http://nwalsh.com/tex/texhelp/bibtx-23.html>
 >=20

At least in life sciences were not free to choose a stylesheet of our
liking. If I want to publish in J.Biol.Chem., I'll have to follow the
citation and bibliography rules of that journal. And if these rules
tell me to format author names like "Last FM" (last name in full,
first name and middle name, if available, as initials), then I must be
able to pull a last, a first and a middle name from the stored data.

 >=20
 > - About sorting
 >=20
 > The question is here: what do we do with:
 > "Donald Knuth", "Donald E. Knuth", "Don Knuth" (without dot!), "D. E=
. Knuth",...
 >=20
 > 1) I think the best answer is: nothing. The tradition in the BibTeX
 > world is:
 >   But an author's complete name may be "Donald E. Knuth" or even
 >   "J. P. Morgan"; you should type it the way the author would like i=
t to
 >   appear, if that's known.
 >=20
 > I think it is the responsibility of the author to "standardize" the
 > way his name is written across articles, and not the role of databas=
es
 > to try to make "clever" but very error-prone merges. Again, lack of
 > information is better than wrong information. Is it such a big
 > deal that the names above are seen as different? After all, they
 > will be sorted just one after the other and match together
 > fuzzy queries. And automated merges are still possible, but as an
 > _ultimate_ step, not corrupting the data and losing information in t=
he
 > first place.
 >=20

This entirely ignores that bibliography styles *require* to rearrange
and reformat the name parts. Sticking with your example, journals
might request:

D Knuth
D.Knuth
D. Knuth
DE Knuth
D.E. Knuth
Knuth D
Knuth, D
Knuth, D.
Knuth DE
Knuth D.E.
Knuth, DE
Knuth, D.E.

and maybe another couple of permutations that I forgot. How Mr. Knuth
would like to read his name is unfortunately irrelevant for the
purposes of citing and creating bibliographies.

 > Still want to hold on <middlename>s and make as little changes as
 > possible? Then twist the original user input as least as possible, a=
nd
 > do only perfectly reversible transformations: name parsing/splitting=

 > it based _only_ on spaces (I know no language where the size of spac=
e
 > is meaningful), the output always gives those spaces back, and there=

 > is no "clever" parsing using dots, dashes or any other sign (can
 > someone affirm that the dot "." is the universal abbreviation sign, =
in any
 > language?)
 >=20

Spaces do not help to distinguish between family and other names. Think=

of authors with double family names. Does "Luis Lopez Penabad" turn
into "Penabad, Luis L" or "Lopez Penabad, Luis" (the
latter). Providing the name as "Lopez Penabad, Luis" removes these
ambiguities (just as using the correct markup in XML does).

 > Users are generally not upset by a software that does NOT add a dot
 > that they forgot, but they get angry when they do not understand
 > at all how and why the software modifies their data, and then they
 > write long emails :-) Moreover, complexity brings bugs; simplicity
 > brings reliability.
 >=20

The process is called normalization. If you provide one entry as
"Miller,AM" and the next one as "Miller,A.M.", these will show up as
two different authors in the database. Normalization will result in
"Miller,A.M." in both cases and will map the entries correctly to the
same author. That is, searching for :AU:=3D"Miller,A.M." will not drop
half of the available entries.

regards,
Markus

--=20
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de