[Refdb-users] Re: The case against <middlename>

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 9 Dec 2003, Markus wrote:

> In American English, where the middle initial is probably used most
> widely, the initial can also derive from the mothers maiden name or
> any other family name. You can't treat this as part of the first name.

Native americans in the office next door to mine treat it like this.
They just do not care where this middlename comes from: it is still a
"given" name. I think this formal definition:

 <familyname(s)> : universally defined by law
 <givenname(s)>  : parents (or similar) freely choose them, possibly
                   according to one local tradition or the other.

suits your example above.

>  > This is the apparent drawback. Suppressing an element means providin=
 g
>  > less information to subsequent tools. However, I think lack of
>  > information is better than incomplete/imprecise information. IMHO,
>
> I beg to differ. No information can't be better than partial
> information. If that were true, we should stop doing research and
> settle with the fact that we'll never know everything precisely.

Let me reformulate: "lack of detail is better than wrong details". No
information is lost by storing all "given" names in <firstname> and
not parsing them.

> At least in life sciences were not free to choose a stylesheet of our
> liking. If I want to publish in J.Biol.Chem., I'll have to follow the
> citation and bibliography rules of that journal. And if these rules
> tell me to format author names like "Last FM" (last name in full,
> first name and middle name, if available, as initials), then I must be
> able to pull a last, a first and a middle name from the stored data.

A style sheet that mandates the use of "middlename" is, to put it
mildly, "culture-specific". If it insists on this, then it should be
able to extract this information _by itself_, and not spoils the
global data model because of this peculiarity. It seems this is
exactly how BibTeX's stylesheets work. References given in a previous
message seem to show that other formats do it the same way.

> This entirely ignores that bibliography styles *require* to rearrange
> and reformat the name parts. Sticking with your example, journals
> might request:
>
> D Knuth
> D.Knuth
> D. Knuth
> DE Knuth
> D.E. Knuth
> Knuth D
> Knuth, D
> Knuth, D.
> Knuth DE
> Knuth D.E.
> Knuth, DE
> Knuth, D.E.
>
> and maybe another couple of permutations that I forgot. How Mr. Knuth
> would like to read his name is unfortunately irrelevant for the
> purposes of citing and creating bibliographies.

I think this *requirement* is more or less flawed. The more
reformatting it requires, the more flawed it is, since the more
(wrong) assumptions it will make concerning "name standardization"
(i.e., that everybody should have a name that is american-english
looking).  The worst assumption is of course the requirement of a
<middlename>.  Assumptions about dots are also flawed, see for
instance: <http://www.delorie.com/users/dj/>

However, simple transformations like : Donald ->=A0D. seem sensible (I
mean: not so flawed), and would allow most of your examples above.

In any case, these dirty issues should not spoil the data model, they
should be (and can be!) postponed and solved by the stylesheets
_themselves_. So mistakes appear only in some printings, and there
are no irreversible mistakes in the data source.

>  > Still want to hold on <middlename>s and make as little changes as
>  > possible? Then twist the original user input as least as possible, a=
nd
>  > do only perfectly reversible transformations: name parsing/splitting
>
>  > it based _only_ on spaces (I know no language where the size of spac=
e
>  > is meaningful), the output always gives those spaces back, and there
>
>  > is no "clever" parsing using dots, dashes or any other sign (can
>  > someone affirm that the dot "." is the universal abbreviation sign, =
in any
>  > language?)

> Spaces do not help to distinguish between family and other names.

Agreed! (even if BibTeX has a complex algorithm to do this, but let's
forget it...)

I was thinking of a 2-steps parsing:

1) separate given and family names using _comma_, just like it is
   today
2) then further parse each one using _only spaces_

The rationale is here: if middlenames should be kept in the data model
(sigh), have at least only simple, perfectly reversible data
transformations in database operations. No dots that magically
appear or disappear, no variable number of tokens, etc. It's always
time to do this at the formatting step.

>  > Users are generally not upset by a software that does NOT add a dot
>  > that they forgot, but they get angry when they do not understand
>  > at all how and why the software modifies their data, and then they
>  > write long emails :-) Moreover, complexity brings bugs; simplicity
>  > brings reliability.

> The process is called normalization. If you provide one entry as
> "Miller,AM" and the next one as "Miller,A.M.", these will show up as
> two different authors in the database. Normalization will result in
> "Miller,A.M." in both cases and will map the entries correctly to the
> same author.

... and this normalization is too complex to be automated, since
no program can correctly handle all particular cases, thus it should
be manually carried out by operators.
I guess this is already the way it goes in most real cases today?

> That is, searching for :AU:=3D"Miller,A.M." will not drop
> half of the available entries.

But searching for :AU:=3D"Miller,A.*M.*" will give a pretty good result,
and reveal to the operator the manual normalization work that must be
completed.

Cheers,

Marc.