[Refdb-users] Re: The case against <middlename>

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 10 Dec 2003, Markus wrote:

>  > I am NOT=3DA0asking to remove the concept of <middlename> down to ev=
ery
>  > refdb line code: I am just suggesting to postpone this concept to th=
e
>  > rendering stage, so it does not spoil the data model.
>  >
>
> It does not spoil the data model to use a human brain for the parsing
> of names.

Right, except that using the RIS syntax imply using a buggy
_automated_ algorithm to do this parsing.

> An unparsed name string is spoilt data.

"unparsed" can obviously not be "spoilt"... but only "rough".

> Marc Herbert writes:
>  > Let me reformulate: "lack of detail is better than wrong details". N=
o
>  > information is lost by storing all "given" names in <firstname> and
>  > not parsing them.

> You lose the information that a human brain can put into parsing the
> name string, using cultural background information that is hard if not
> impossible to teach to a machine.

I am glad to hear this! Then fix the _automated_ RIS parsing/syntax by
adding a comma to it?

>  > A style sheet that mandates the use of "middlename" is, to put it
>  > mildly, "culture-specific". If it insists on this, then it should be
>  > able to extract this information _by itself_, and not spoils the
>  > global data model because of this peculiarity. It seems this is
>  > exactly how BibTeX's stylesheets work. References given in a previou=
s
>  > message seem to show that other formats do it the same way.

> Once again, go complain to the publishers of roughly 5000 journals in
> the life sciences.

I did not know publishers of 5000 life sciences journals where so
english-centric and ignorant of foreign cultures.  This bug is quite
amazing.

>=A0I also believe that your argument is moot that if a
> style requires the concept of middle names it should be able to
> retrieve the middle name by itself. With the same argument you could
> dump entirely unparsed strings in any order onto a bib software and
> expect it to figure out how to parse it, as it requires to disginguish
> between given and family names, titles and suffixes.

If I remember well, this discussion is about the right level of detail
to adopt and where. So I find "With the same argument you could dump
entirely unparsed strings" not very constructive.

>=A0This simply expresses your dislike of middle names.
                        ^^^^ !

Please go complain to the publishers of most of the truly
international journals (except life sciences), and to the designers of
all bibliographic formats I've seen (except risx).

I started to "dislike" middlenames, only after doing research and
understanding that almost no one use them.

>  > I think this *requirement* is more or less flawed. The more
>  > reformatting it requires, the more flawed it is, since the more
>  > (wrong) assumptions it will make concerning "name standardization"
>  > (i.e., that everybody should have a name that is american-english
>  > looking).  The worst assumption is of course the requirement of a
>  > <middlename>.  Assumptions about dots are also flawed, see for
>  > instance: <http://www.delorie.com/users/dj/>

> Once again, I didn't invent these requirements. I have to support them
> if I want to support the 5000+ journals in the life sciences.

>  > In any case, these dirty issues should not spoil the data model, the=
y
>  > should be (and can be!) postponed and solved by the stylesheets
>  > _themselves_. So mistakes appear only in some printings, and there
>  > are no irreversible mistakes in the data source.

> I don't think it is a brilliant idea to have each of 700+ stylesheets
> (if we consider only the life sciences for a moment) parse and munge
> the names by themselves. Code duplication and bloating would be
> inevitable. I'd rather have stupid simple stylesheets that use the
> preparsed names from the application.

This life-science-specific "middlename parsing" could be factorized
without being put down to the database. So refdb could be used
internationally without bugs and hassles: just by working around it.
Why not adding a "-[no]middlename" option for outputs ?

Same thing for the "clever" abbreviating code.

>  > The rationale is here: if middlenames should be kept in the data mod=
el
>  > (sigh), have at least only simple, perfectly reversible data
>  > transformations in database operations. No dots that magically
>  > appear or disappear, no variable number of tokens, etc. It's always
>  > time to do this at the formatting step.

> That's too late as I pointed out elsewhere. You need the normalization
> when you enter the data into the database to have a consistent and
> reliable way to search names.

No it's not too late: you can also play the same game with dots and
spaces later at search/formatting time, without subtly and silently
modifying the data that the user intently input; that is losing
information really.

It's just about where sits this "clever" code.

>  > ... and this normalization is too complex to be automated, since
>  > no program can correctly handle all particular cases, thus it should
>  > be manually carried out by operators.
>  > I guess this is already the way it goes in most real cases today?
>  >
>
> So if you want to import 100 references that a nice colleague just
> sent you, you start adding/removing spaces and dots from somewhere
> between 100 and 1000 author names? Problematic as it may be in border
> cases, this is a job that *asks* to be automated.

Yes, but as you said above:

> You lose the information that a human brain can put into parsing the
> name string, using cultural background information that is hard if not
> impossible to teach to a machine.

so maybe the conclusion is that it should be "computer-assisted",
instead of "fully automated" ?

Please do never silently and subtly modify user data. At least ask for
confirmation! The real world is too complex for any "clever" names
standardization algorithm.

>=A0If it fails in too many cases, we have to improve the code.

OK: I suggest one *extremely* simple improvement to this code: the
ability to disable it, at least at configure time (I will code this
for myself in any case).

>  > But searching for :AU:=3D3D"Miller,A.*M.*" will give a pretty good r=
esult,
>  > and reveal to the operator the manual normalization work that must b=
e
>  > completed.
>  >
>
> This is what a reference manager should avoid at all costs. Why on
> earth should a user be forced to use regular expressions just to find
> references by author names? If this is necessary the data model is
> flawed.

I made a discovery: the real-world data model for international names
is flawed, at least beyond the "family" and "given" name distinction.
Some people even make this more fuzzy by not signing with precisely
the same character strings each time. And worst of all: different
databases try to "standardize" this naming mess... in different ways!

Should we also "normalize" the reality for the please of
bibliographers? I prefer not to wait this long, live with it, and
learn to use jokers while doing name searches; I guess that's what
everyone is already doing today.

Cheers,

--=20
Marc A.Yves Herbert :-)