[Refdb-users] "reversibility" patch

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Marc Herbert writes:
 >=20
 > > > I'll be happy to add a section to the docs in all caps and a red=
 box
 > > > around it stating that author names will be normalized for the s=
ake of
 > > > consistency.
 >=20
 > I could not find this yet in
 > <http://refdb.sourceforge.net/manual-0.9.4/book1.html>
 >=20
 > Explaining "how" they are normalized also seems rather vital to me.
 >=20

Sorry. Didn't get round to it yet.

 > BTW, while testing and comparing, I found some quirks that do not se=
em
 > to fit _any_ logic (as opposed to: not fit my taste).
 >=20

In this case you should document them and file a bug report instead of
spreading FUD.

 >   ----------------------------------------------------------
 >=20
 >               The "reversible" refdb patch
 >=20
 > Marc Herbert
 > $Date: 2004/01/05 21:30:50 $
 > $Revision: 1.2 $
 >=20
 >=20
 >                   ---- The issue ----
 >=20
 > Currently, refdb tries to "normalize" authors' name inputed in the
 > database, in order to avoid false duplicates and maybe to cope with
 > weird requirements of some bibliographic stylesheets. This means
 > fiddling with full stops and so-called "middlenames".
 >=20
 > I think refdb should either reliably perform this normalization
 > according to a documented, reviewed and formal specification
 > -- or not at all. Today it does it in an undocumented way,
 > silently modifying some user data with potential information loss in=

 > corner cases.
 >=20

The purpose of the name mangling is to reduce all names consistently
to the RIS input format. This is currently the common denominator of
both RIS and RISX input until a richer data format like MODS is
implemented. If the name mangling is not consistent, then it is a bug
that needs to be fixed, not a feature that needs to be removed.

The bottom line is: if you supply your RIS data according to the RIS
input format, they won't be fiddled with at all. If you use a
different format, e.g. by leaving out periods or by adding random
spaces, RefDB attempts to mangle the data until they fit the RIS input
format. This works in many cases, but may fail in border cases. The
important thing to understand is that the dots and spaces used in the
RIS input format do not have anything to do with the final
representation of a name in a formatted bibliography. The sole purpose
of the dots and spaces is to separate the name parts in order to tell
the parser where to chop. You could use slashes or question marks just
as well. As it is the job of a bibliography software to output the
author names in all possible formatting variations, it is essential
not to store pre-formatted data in the database. However, it may be
useful (see below) to store pre-parsed data.

The same principle basically applies to the RISX input
format. However, the RISX format provides separate elements for the
name parts, so there is no need for textual separators at all. There
is no point to enter a middle initial as
<middlename>B.</middlename>. The middle initial is "B", not "B.". "B."
is a representation of a middle name which is used in some
bibliography styles (others don't use the dot or leave out the middle
name altogether) and can be trivially generated from "B". Therefore, a
<middlename>B</middlename> is all you need. If RefDB detects the
superfluous dot, it will remove it.

 > This (short and simple) refdb patch disables all modifications of
 > user-data, and lets the user decide by himself how names should be
 > "normalized" (assuming it's both desirable and possible).
 > Thanks to it, what gets _in_ refdb, gets _out_ untouched.
 > For instance, if you enter "Harry S Truman" in refdb, you would get =
back:
 > - without this patch:      "Harry S. Truman"
 > - with this patch:         "Harry S Truman" (amazing! and "reversibl=
e"...)
 >=20

Now we get to the purpose of normalization. As stated above, the data
in the AU field of a RIS dataset or an <author> element are not
strings that are supposed to be inserted into a bibliography as they
are. They are input formats that supply data (the name parts) for one
object in the database (an author). If an author has several reference
entries in the database, these entries must link to the same object
(the author), not to a specific representation of the author's
name. Assume the following cases:

Truman,Harry S.
Truman,Harry S
Truman, Harry S
Truman, Harry S.

The first one is what the RIS input format asks for. The others aren't
that different except for a space or a dot here and there. If these
belong to four references among 100, you probably wouldn't even notice
that the author names are written differently, although it is clear
that they mean the same author. If you add these four datasets to
RefDB, the first entry won't be mangled at all (as it sticks to the
rules). The other entries are normalized, and as a consequence, all
four references link to the same author. The normalized internal
representation of the author name is "Truman,Harry S." (amazing! and
"reversible"...).

If you go ahead and prevent this normalization, the four references
will point to four different author objects, one with the
representation "Truman,Harry S.", another one with the representation
"Truman,Harry S", and so forth. If you now run a query for references
by some "Truman,Harry S.", you'll miss 75% of the possible hits. This
is not good.

You can obviously work around this weakness of the patch by running
all queries against regular expressions, but this is not an option if
you design a simplified interface that allows users to pick names from
a list (something Mike is currently working on).

 > Warning: this patch may or may not break further formatting by some
 > bibliographic stylesheets, depending if they expect "normalized" nam=
es
 > from the database. I do not care much about breaking stylesheets tha=
t
 > want you to change the way you write your name (probably in a more
 > "english" way).  I do not mind if they munge names when formatting f=
or
 > publication, but pushing this "normalization" up to the database is
 > not acceptable to me. After all, respectful and less rigid formattin=
g
 > tools also (co-)exist.  The answer to this question is likely to be =
in
 > the following function: backend-dbiba.c:format_firstmiddlename()

This is the key point why we have to argue at all. You do not
understand that the database does not contain a formatted string that
shows how you would like to see your name printed on a piece of
paper. The database contains the name parts, plus a normalized
representation for speeding up queries that happens to look like some
formatted representation. When creating a bibliography, RefDB then has
to assemble the name parts in a fashion that matches the requirements
of the publisher. It is irrelevant how the cited author or the author
writing the paper would like to represent that name.

 >=20
 > By the way, be aware that you should NOT use spaces at the beginning=

 > or at the end of RISX <name>(s), since this will lead to false
 > duplicates in the database _independently from this patch_. On the
 > other hand, RIS input (AU - field) is more or less space-insensitive=
.
 >=20

The RIS input is insensitive to leading and trailing spaces as the
latter are basically invisible in this input format. I have not
anticipated that anyone would add stray spaces to XML elements as they
are easily detected, but if this is a common problem it could be
handled just as well.

 >=20
 >=20
 > The SQL database uses 4 (redundant) fields to store author names:
 >        fullname, lastname, firstname, middlenameS
 >=20

The columns are not redundant. Redundancy implies that they hold the
same information but this is not the case. author_lastname,
author_firstname, and author_middlename hold the pre-parsed name parts
which are different by definition. The author_name field holds the
normalized representation of the full name or a corporate name. The
latter doesn't have name parts but it can't go into
e.g. author_lastname either as we have to distinguish between authors
that have only one name and corporate names.

The only redundancy in this setup is that a non-corporate name could
be assembled from the name parts. However, author names are usually
added once and then queried each time someone requests a reference or
a bibliography containing that name. For the sake of speed it makes
sense to parse the name once (when you add it) instead of each time it
is retrieved.

 > __________________________
 > Modifications to RIS input
 > (i.e., "addref -t ris")
 >=20

[...]

 >                         RIS input examples
 >=20
 >                                   Smith,   F.M.N.
 >                                   Chu,     H.K. Jerry
 >                                   Truman,  Harry S
 >=20
 >                     ->    database results
 >=20
 >  official    : "Smith,F.M.N."    "Smith"  "F"          "M N"
 >  patched     : "Smith,F.M.N."    "Smith"  "F.M.N."
 >=20
 >  official    : "Chu,H.K.Jerry"   "Chu"    "H"          "K Jerry "
 >  patched     : "Chu,H.K.Jerry"   "Chu"    "H.K.Jerry"
 >=20
 >  official    : "Truman,Harry S." "Truman" "Harry"      "S "
 >  patched     : "Truman,Harry S"  "Truman" "Harry S"
 >=20
 > (also notice the spurious space ending some middlenames with the
 > official version).

These spaces are due to a bug introduced after adding support for
multiple middle names. Fixed in CVS.

Please note that the last output of the patched version does not
follow the RIS specs, therefore it is not clear whether RefMan,
EndNote and the like import this properly.

 >=20
 >=20
 > ____________________________
 > Mmodifications to RISX input
 > (i.e., "addref -t risx")
 >=20
 > - full stops "tricks" are disabled
 >=20

As stated above, you should not use periods anyway as they are not
required. Following this simple rule will make most of your complaints
obsolete.

 >                           RISX input examples
 >=20
 >                                 "Smith"   "F."      "M."    "N."
 >                                 "Truman"  "Harry"   "S"
 >                                 "Chu"     "H.K."    "Jerry"
 >=20
 >                     ->    database results
 >=20
 >  official :  "Smith,F.M.N."     "Smith"   "F"      "M N"
 >  patched  :  "Smith,F. M. N."   "Smith"   "F."     "M. N."

Whether or not to use spaces after initials is a formatting issue that
is handled by the bibliography style. A period is enough as a
separator for the internal representation. The spaces are redundant
and bloat the data without a reason.

 >=20
 >  official :  "Truman,Harry S."  "Truman"  "Harry"  "S"
 >  patched  :  "Truman,Harry S"   "Truman"  "Harry"  "S"
 >=20

Again, the patched output may not be readable by other tools using RIS.=

 >  official :  "Chu,H.Jerry"      "Chu"     "H"      "Jerry"    (infor=
mation loss!)
 >  patched  :  "Chu,H.K. Jerry"   "Chu"     "H.K."   "Jerry"
 >=20

Please provide the RISX input that you used for this example. The
following input works just fine for me without any loss of data:

      <author>
=09<lastname>Chu</lastname>
=09<firstname>H</firstname>
=09<middlename>K</middlename>
=09<middlename>Jerry</middlename>
      </author>

(the markup is odd but RISX currently does not support something like
a "prime" given name which is not in the first position, as in
"M. Steven Miller". RIS does not support this either, so this will be
handled properly only by the forthcoming MODS-like data model)

Please note that in the official examples given above, most of the
output is correct although an improper input format was used. This is
what normalization is all about.

The only problem that I've come across while looking at these examples
is that the current implementation does not handle abbreviated double
names very well. "Schleifer,Karl-Heinz" is ok, but "Schleifer,K.-H."
will cause problems to the best of my knowledge. I'll look into this
and fix it if necessary.

 > However, for some unknown reason, bibtex output pulls the fullname
 > from the database and parses it again, so a small patch was needed
 > here again to prevent the addition of full stops.
 >=20

The "unknown reason" is negligence. I haven't heard positively of
anyone using the bibtex output, so this gets somewhat less attention
than it should.

 >=20
 > __________
 > Convertors
 >=20
 > The "nmed2ris" convertor also fiddles with authors' names in a simil=
ar
 > way. I can not yet say more about this, sorry: I do not use the
 > MED=A0format at all and could not have tested modifications.
 >=20

It is clearly stated in the manual that this program is obsolete and
will eventually be removed from the distribution. If at all, have a
look at the med2ris.pl script.

regards,
Markus

--=20
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de