[Refdb-users] Re: The case against <middlename>

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Marc Herbert writes:
 > Let me reformulate: "lack of detail is better than wrong details". No
 > information is lost by storing all "given" names in <firstname> and
 > not parsing them.
 > 

You lose the information that a human brain can put into parsing the
name string, using cultural background information that is hard if not
impossible to teach to a machine.

 > A style sheet that mandates the use of "middlename" is, to put it
 > mildly, "culture-specific". If it insists on this, then it should be
 > able to extract this information _by itself_, and not spoils the
 > global data model because of this peculiarity. It seems this is
 > exactly how BibTeX's stylesheets work. References given in a previous
 > message seem to show that other formats do it the same way.
 > 

Once again, go complain to the publishers of roughly 5000 journals in
the life sciences. I also believe that your argument is moot that if a
style requires the concept of middle names it should be able to
retrieve the middle name by itself. With the same argument you could
dump entirely unparsed strings in any order onto a bib software and
expect it to figure out how to parse it, as it requires to disginguish
between given and family names, titles and suffixes. This simply
expresses your dislike of middle names.

 > I think this *requirement* is more or less flawed. The more
 > reformatting it requires, the more flawed it is, since the more
 > (wrong) assumptions it will make concerning "name standardization"
 > (i.e., that everybody should have a name that is american-english
 > looking).  The worst assumption is of course the requirement of a
 > <middlename>.  Assumptions about dots are also flawed, see for
 > instance: <http://www.delorie.com/users/dj/>
 > 

Once again, I didn't invent these requirements. I have to support them
if I want to support the 5000+ journals in the life sciences.

 > In any case, these dirty issues should not spoil the data model, they
 > should be (and can be!) postponed and solved by the stylesheets
 > _themselves_. So mistakes appear only in some printings, and there
 > are no irreversible mistakes in the data source.
 > 

I don't think it is a brilliant idea to have each of 700+ stylesheets
(if we consider only the life sciences for a moment) parse and munge
the names by themselves. Code duplication and bloating would be
inevitable. I'd rather have stupid simple stylesheets that use the
preparsed names from the application.

 > The rationale is here: if middlenames should be kept in the data model
 > (sigh), have at least only simple, perfectly reversible data
 > transformations in database operations. No dots that magically
 > appear or disappear, no variable number of tokens, etc. It's always
 > time to do this at the formatting step.
 > 

That's too late as I pointed out elsewhere. You need the normalization
when you enter the data into the database to have a consistent and
reliable way to search names.

 > ... and this normalization is too complex to be automated, since
 > no program can correctly handle all particular cases, thus it should
 > be manually carried out by operators.
 > I guess this is already the way it goes in most real cases today?
 > 

So if you want to import 100 references that a nice colleague just
sent you, you start adding/removing spaces and dots from somewhere
between 100 and 1000 author names? Problematic as it may be in border
cases, this is a job that *asks* to be automated. If it fails in too
many cases, we have to improve the code.

 > But searching for :AU:=3D"Miller,A.*M.*" will give a pretty good result,
 > and reveal to the operator the manual normalization work that must be
 > completed.
 > 

This is what a reference manager should avoid at all costs. Why on
earth should a user be forced to use regular expressions just to find
references by author names? If this is necessary the data model is
flawed.

regards,
Markus

-- 
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de