[Refdb-users] The case against <middlename>
Status: Beta
Brought to you by:
mhoenicka
From: Marc H. <Mar...@en...> - 2003-12-09 17:23:57
|
The (long) case against <middlename> In brief -------- Whereas the distinction between <firstname> and <lastname>, is quite shared across different cultures, since it can easily and formally defined as "given" name and "family" name, the notion of <middlename> seems very culture-specific, and its inclusion in RISX brings more issues than benefits. I suggest its suppression from the RISX DTD and the refdb databases (just like in other similar formats) If not completely suppressed, at least the parsing of RIS authors should be simplified a lot in order to become predictable, and the "clever" tricks with dots should be disabled. Definition issue: what is a "middlename" actually? -------------------------------------------------- In english, the middle name is a "second firstname", mainly used as disambiguator. It is more an extension of the <firstname> than a first order part of the whole <author>. It may be only a nickname. In french and spanish, a firstname can be a compound of several "tokens" up to 3 or more. <http://klamath.stanford.edu/~molinero/html/surname.html> Whereas in english a middlename is generally of low importance, parts of a compound firstname may be of equal importance and inseparable. Please have a look at some Arabic, Persian and Indian names here: <http://www-cs-faculty.stanford.edu/~knuth/help.html#exotic> and try to tell what are their <middlename>s :-) I think the definition of a "middlename" is very different from one culture to another, and even being "undefined" for some. Current parsing bugs in refdb 0.9.4-pre2 ---------------------------------------- I take a border, but real-world example. H.K. Jerry Chu <http://citeseer.nj.nec.com/chu96zerocopy.html> H.K. stands for Hsiao-Keng, it is a abbrieviated compound name. Anecdote: "Jerry" is here because very few people in the western world are able to pronounce correctly "Hsiao-Keng".That's also probably the reason why Hsiao-Keng became "H.K."; to avoid hearing painful sounds. 1) RIS input -> -> sqlite encoding AU - Chu,H.K. Jerry 'Chu,H.K.Jerry','Chu','H','K Jerry ' -> RISX output ! <lastname>Chu</lastname> <firstname>H</firstname> <middlename>K</middlename> <middlename>Jerry</middlename> 2) RISX input -> -> sqlite encoding <lastname>Chu</lastname> 'Chu,H.Jerry','Chu','H','Jerry' <firstname>H.K.</firstname> <middlename>Jerry</middlename> RISX output -> -> RIS output <lastname>Chu</lastname> AU - Chu,H.Jerry <firstname>H</firstname> <middlename>Jerry</middlename> Same bug (the K. is lost) with H.-K. Others formats/tools -------------------- Alternatives to RISX do not know the concept of <middlename> * BibTeX does not have it <http://nwalsh.com/tex/texhelp/bibtx-23.html> * RIS (!) does not have it <http://www.refman.com/support/risformat_tags_02.asp> * If I understand well (please confirm, Bruce?), MODS also only knows "family" and "given" as nameparts. <http://www.loc.gov/standards/mods/mods-outline.html#name> I did not understand the meaning of the "date" attribute (thanks in advance for explaining), but I guess it is not equivalent to a middlename :-) BTW, I like the choice of "family" and "given" as attributes, they look very universal, and emphase the meaning as opposed to a somewhat controversial "position". For all the above formats, the middlename is just a part of the firstname. Similarly,... * ...for TEI, the richest format, <firstname> and <middlename> are just two <forename> <http://www.tei-c.org/P4X/ref-PERSNAME.html> <http://www.tei-c.org/P4X/ND.html#NDPER> Interesting note from TEI: The "type" attribute may be used with both <foreName> and <surname> elements to provide further culture- or project- specific detail about the name component, for example: <foreName type="first">Franklin</foreName> <foreName type="middle">Delano</foreName> ^^^^^^^^ The fact that all these formats do not have the concept of <middlename>, or at best relegated to a un-standardized, cultural-specific value of a sub-sub-attribute, teaches us two things: - their designers found this concept not very useful - data conversion from/to them will be easier if risx does not have it Issues brought by suppressing <middlename> ------------------------------------------ * Migration issue Suggested migration path: no change to database format yet, but while: - outputting: systematically concatenate firstname and middlenames for legacy records (separated by a space). - inputting: systematically put <middlename> to NULL for new records, and store everything in <firstname> I think these simplifications are easy to code, and I volunteer to do them. * Formatting/sorting/... issues for subsequent operations This is the apparent drawback. Suppressing an element means providing less information to subsequent tools. However, I think lack of information is better than incomplete/imprecise information. IMHO, <middlename> carries a refinement that belongs only to a very detailed level of name representation (at least as detailed as the TEI model). Using <middlename> together with <firstname> and <lastname> is only a halfhearted (and thus imprecise) attempt to more deeply parse the name. And as shown above, the RIS input syntax is not ready for that, (I mean: AU - Lastname[,(F.|First)[(M.|Middle)[,Suffix]]] is not "clean"), and the RISX input is buggy. - About formatting LaTeX/BibTeX for instance performs a second stage parsing (part -> tokens) that relies on spaces, capitals and dots. It allows automated abbreviations among others. The user can use a "hack" (escape braces {} inlined in the data) to prevent any "too clever" formatting. The need for this hack proves that the automated formatting may fail to address specific cases. But at least the data model is simple and thus can't be wrong: all tokens of the complete given name are stored together in the same string; if one stylesheet does the formatting wrong, another one may do it right. <http://nwalsh.com/tex/texhelp/bibtx-23.html> - About sorting The question is here: what do we do with: "Donald Knuth", "Donald E. Knuth", "Don Knuth" (without dot!), "D. E. Knuth",... 1) I think the best answer is: nothing. The tradition in the BibTeX world is: But an author's complete name may be "Donald E. Knuth" or even "J. P. Morgan"; you should type it the way the author would like it to appear, if that's known. I think it is the responsibility of the author to "standardize" the way his name is written across articles, and not the role of databases to try to make "clever" but very error-prone merges. Again, lack of information is better than wrong information. Is it such a big deal that the names above are seen as different? After all, they will be sorted just one after the other and match together fuzzy queries. And automated merges are still possible, but as an _ultimate_ step, not corrupting the data and losing information in the first place. 2) Unless it is decided something weird like: author names are equal iff their <firstname> are equal (and we do not care about <middlename>s), then <middlename> does not help in solving the (difficult) problem above. Mild alternative ---------------- Still want to hold on <middlename>s and make as little changes as possible? Then twist the original user input as least as possible, and do only perfectly reversible transformations: name parsing/splitting it based _only_ on spaces (I know no language where the size of space is meaningful), the output always gives those spaces back, and there is no "clever" parsing using dots, dashes or any other sign (can someone affirm that the dot "." is the universal abbreviation sign, in any language?) Users are generally not upset by a software that does NOT add a dot that they forgot, but they get angry when they do not understand at all how and why the software modifies their data, and then they write long emails :-) Moreover, complexity brings bugs; simplicity brings reliability. Comments? |