[Refdb-users] The case against <middlename>

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

               The (long) case against <middlename>

In brief
--------

Whereas the distinction between <firstname> and <lastname>, is quite
shared across different cultures, since it can easily and formally
defined as "given" name and "family" name, the notion of <middlename>
seems very culture-specific, and its inclusion in RISX brings more
issues than benefits. I suggest its suppression from the RISX DTD and
the refdb databases (just like in other similar formats)

If not completely suppressed, at least the parsing of RIS authors
should be simplified a lot in order to become predictable, and the
"clever" tricks with dots should be disabled.

Definition issue: what is a "middlename" actually?
--------------------------------------------------

In english, the middle name is a "second firstname", mainly used as
disambiguator.  It is more an extension of the <firstname> than a
first order part of the whole <author>. It may be only a nickname.

In french and spanish, a firstname can be a compound of several
"tokens" up to 3 or more.
<http://klamath.stanford.edu/~molinero/html/surname.html> Whereas in
english a middlename is generally of low importance, parts of a
compound firstname may be of equal importance and inseparable.

Please have a look at some Arabic, Persian and Indian names here:
  <http://www-cs-faculty.stanford.edu/~knuth/help.html#exotic>
and try to tell what are their <middlename>s :-)

I think the definition of a "middlename" is very different from one
culture to another, and even being "undefined" for some.

Current parsing bugs in refdb 0.9.4-pre2
----------------------------------------

I take a border, but real-world example.

H.K. Jerry Chu <http://citeseer.nj.nec.com/chu96zerocopy.html>

H.K. stands for Hsiao-Keng, it is a abbrieviated compound name.
Anecdote: "Jerry" is here because very few people in the western world
are able to pronounce correctly "Hsiao-Keng".That's also probably the
reason why Hsiao-Keng became "H.K."; to avoid hearing painful sounds.

1)   RIS input ->              -> sqlite encoding

AU - Chu,H.K. Jerry        'Chu,H.K.Jerry','Chu','H','K Jerry '

     -> RISX output !

  <lastname>Chu</lastname>
  <firstname>H</firstname>
  <middlename>K</middlename>
  <middlename>Jerry</middlename>

2)  RISX input ->                  ->  sqlite encoding

   <lastname>Chu</lastname>          'Chu,H.Jerry','Chu','H','Jerry'
   <firstname>H.K.</firstname>
   <middlename>Jerry</middlename>

    RISX output  ->                  ->  RIS output

   <lastname>Chu</lastname>           AU  - Chu,H.Jerry
   <firstname>H</firstname>
   <middlename>Jerry</middlename>

Same bug (the K. is lost) with H.-K.

Others formats/tools
--------------------

Alternatives to RISX do not know the concept of <middlename>

* BibTeX does not have it
<http://nwalsh.com/tex/texhelp/bibtx-23.html>

* RIS (!) does not have it
<http://www.refman.com/support/risformat_tags_02.asp>

* If I understand well (please confirm, Bruce?), MODS also only knows
"family" and "given" as nameparts.
  <http://www.loc.gov/standards/mods/mods-outline.html#name>
I did not understand the meaning of the "date" attribute (thanks in
advance for explaining), but I guess it is not equivalent to a
middlename :-) BTW, I like the choice of "family" and "given" as
attributes, they look very universal, and emphase the meaning as
opposed to a somewhat controversial "position".

For all the above formats, the middlename is just a part of the firstname.
Similarly,...
* ...for TEI, the richest format, <firstname> and <middlename> are just two <forename>
<http://www.tei-c.org/P4X/ref-PERSNAME.html>
<http://www.tei-c.org/P4X/ND.html#NDPER>

Interesting note from TEI:

  The "type" attribute may be used with both <foreName> and <surname> elements
  to provide further culture- or project- specific detail about the name
  component, for example:

  	     <foreName type="first">Franklin</foreName>
	     <foreName type="middle">Delano</foreName>
                            ^^^^^^^^

The fact that all these formats do not have the concept of <middlename>, or
at best relegated to a un-standardized, cultural-specific value of a
sub-sub-attribute, teaches us two things:

- their designers found this concept not very useful
- data conversion from/to them will be easier if risx does not have it

Issues brought by suppressing <middlename>
------------------------------------------

* Migration issue
Suggested migration path: no change to database format yet, but while:
- outputting: systematically concatenate firstname and middlenames for
  legacy records (separated by a space).
- inputting: systematically put <middlename> to NULL for new records,
  and store everything in <firstname>
I think these simplifications are easy to code, and I volunteer to do them.

* Formatting/sorting/... issues for subsequent operations

This is the apparent drawback. Suppressing an element means providing
less information to subsequent tools. However, I think lack of
information is better than incomplete/imprecise information. IMHO,
<middlename> carries a refinement that belongs only to a very detailed
level of name representation (at least as detailed as the TEI model).
Using <middlename> together with <firstname> and <lastname> is only a
halfhearted (and thus imprecise) attempt to more deeply parse the
name. And as shown above, the RIS input syntax is not ready for that,
(I mean: AU - Lastname[,(F.|First)[(M.|Middle)[,Suffix]]] is not
"clean"), and the RISX input is buggy.

- About formatting

LaTeX/BibTeX for instance performs a second stage parsing
(part -> tokens) that relies on spaces, capitals and dots. It allows
automated abbreviations among others.  The user can use a "hack"
(escape braces {} inlined in the data) to prevent any "too clever"
formatting. The need for this hack proves that the automated
formatting may fail to address specific cases. But at least the data
model is simple and thus can't be wrong: all tokens of the complete
given name are stored together in the same string; if one stylesheet
does the formatting wrong, another one may do it right.
<http://nwalsh.com/tex/texhelp/bibtx-23.html>

- About sorting

The question is here: what do we do with:
"Donald Knuth", "Donald E. Knuth", "Don Knuth" (without dot!), "D. E. Knuth",...

1) I think the best answer is: nothing. The tradition in the BibTeX
world is:
  But an author's complete name may be "Donald E. Knuth" or even
  "J. P. Morgan"; you should type it the way the author would like it to
  appear, if that's known.

I think it is the responsibility of the author to "standardize" the
way his name is written across articles, and not the role of databases
to try to make "clever" but very error-prone merges. Again, lack of
information is better than wrong information. Is it such a big
deal that the names above are seen as different? After all, they
will be sorted just one after the other and match together
fuzzy queries. And automated merges are still possible, but as an
_ultimate_ step, not corrupting the data and losing information in the
first place.

2) Unless it is decided something weird like: author names are equal
iff their <firstname> are equal (and we do not care about
<middlename>s), then <middlename> does not help in solving the
(difficult) problem above.

Mild alternative
----------------

Still want to hold on <middlename>s and make as little changes as
possible? Then twist the original user input as least as possible, and
do only perfectly reversible transformations: name parsing/splitting
it based _only_ on spaces (I know no language where the size of space
is meaningful), the output always gives those spaces back, and there
is no "clever" parsing using dots, dashes or any other sign (can
someone affirm that the dot "." is the universal abbreviation sign, in any
language?)

Users are generally not upset by a software that does NOT add a dot
that they forgot, but they get angry when they do not understand
at all how and why the software modifies their data, and then they
write long emails :-) Moreover, complexity brings bugs; simplicity
brings reliability.

Comments?