Re: [Refdb-users] "reversibility" patch

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 6 Jan 2004, Markus Hoenicka wrote:

>
> The purpose of the name mangling is to reduce all names consistently
> to the RIS input format. This is currently the common denominator of
> both RIS and RISX input until a richer data format like MODS is
> implemented.

This is one of the core issues indeed (thanks for starting with it :-)

The idea of trying to avoid false duplicates is great, even if it will
never be 100% reliable and always depend for some part from the human
typist (think for instance about: abbreviated or full input?). But
using the RIS input format to implement it is so wrong to me that I
prefer to forget about it for the moment. It's a tradeoff, and you
surely can understand that I have a different perspective than you
about it. Knowing in advance that our opinions differ, I posted this
patch on my web page instead of in some sourceforge tracker. I can't
see any FUD=A0in this, just a different use of your software.

In some "MODS" future, if a name reduction format/scheme that I trust
is available, then I will be happy to give it my data. Meanwhile, I
prefer to keep it intact, since it does not fit into the RIS format.

> If the name mangling is not consistent, then it is a bug
> that needs to be fixed, not a feature that needs to be removed.

Great!

Unfortunately, if the syntax of the target format (RIS) is flawed from
the start, you cannot achieve full consistency, whatever your efforts
are. Moreover, even "half-consistency" becomes harder and prone to
overcomplicated code and bugs, as we see.

> The bottom line is: if you supply your RIS data according to the RIS
> input format, they won't be fiddled with at all. If you use a
> different format, e.g. by leaving out periods or by adding random
> spaces, RefDB attempts to mangle the data until they fit the RIS input
> format. This works in many cases, but may fail in border cases.

This is crystal clear. Now my point: I care about border cases, and I
don't care about false duplicates. So I disable mangling. Simple! This
is a bit exaggerated, but you get the point.

By the way, about border cases:
 <http://catb.org/~esr/writings/taoup/html/ch01s06.html>
Rule of Repair: Repair what you can -- but when you must fail, fail
noisily and as soon as possible.

  Software should be transparent in the way that it fails, as well as
  in normal operation. It's best when software can cope with
  unexpected conditions by adapting to them, but the worst kinds of
  bugs are those in which the repair doesn't succeed and the problem
  quietly causes corruption that doesn't show up until much later.

Sorry, I do not have so many different..=A0references :-)

>=A0The important thing to understand is that the dots and spaces used in t=
he
> RIS input format do not have anything to do with the final
> representation of a name in a formatted bibliography.

> The sole purpose of the dots and spaces is to separate the name
> parts in order to tell the parser where to chop.

The important thing to understand is that dots may be meaningful to
some author name in some language (including english), so they are
the not far from the worst separator ever.

> You could use slashes or question marks just as well.

If the designer of the RIS=A0format had had some clue, he could have
spared us a lot of discussion and time.

>=A0As it is the job of a bibliography software to output the
> author names in all possible formatting variations, it is essential
> not to store pre-formatted data in the database.
> However, it may be useful (see below) to store pre-parsed data.

Great, something we agree about! :-)

> The same principle basically applies to the RISX input
> format. However, the RISX format provides separate elements for the
> name parts, so there is no need for textual separators at all. There
> is no point to enter a middle initial as
> <middlename>B.</middlename>. The middle initial is "B", not "B.". "B."

The middlename maybe "B" without being an initial. More generally,
the existence or the non-existence of the dot maybe an information
that some refdb user does not want to lose, at least not in the
database (even if he does not care about some formatted output).

> is a representation of a middle name which is used in some
> bibliography styles (others don't use the dot or leave out the middle
> name altogether) and can be trivially generated from "B". Therefore, a
> <middlename>B</middlename> is all you need. If RefDB detects the
> superfluous dot, it will remove it.

I am really hopeless about making you understand how and why I
disagree with: "the superfluous dot". Can't you just accept it as a
fact? I also disagree with the "middlename" concept, but this was
another story :-)

> This is the key point why we have to argue at all. You do not
> understand that the database does not contain a formatted string that
> shows how you would like to see your name printed on a piece of
> paper.

> The database contains the name parts, plus a normalized
> representation for speeding up queries that happens to look like some
> formatted representation. When creating a bibliography, RefDB then has
> to assemble the name parts in a fashion that matches the requirements
> of the publisher.

> It is irrelevant how the cited author or the author
> writing the paper would like to represent that name.

You really do not understand that, if some information is lost in this
great process, whatever its noble purpose is, some author may _never_
see his name printed as he would like to, even when some stylesheets
allow it. I will suggest in a next message (this one is already too
long and too chaotic, and I still have to think a bit about it) a
better solution that may please everyone (no more trade-off). Assuming
you understand that I have slighty different refdb needs, so we can
discuss about it.

>  > __________________________
>  > Modifications to RIS input
>  > (i.e., "addref -t ris")
>  >
>
> [...]
>
>  >                         RIS input examples
>  >
>  >                                   Smith,   F.M.N.
>  >                                   Chu,     H.K. Jerry
>  >                                   Truman,  Harry S
>  >
>  >                     ->    database results
>  >
>  >  official    : "Smith,F.M.N."    "Smith"  "F"          "M N"
>  >  patched     : "Smith,F.M.N."    "Smith"  "F.M.N."
>  >
>  >  official    : "Chu,H.K.Jerry"   "Chu"    "H"          "K Jerry "
>  >  patched     : "Chu,H.K.Jerry"   "Chu"    "H.K.Jerry"
>  >
>  >  official    : "Truman,Harry S." "Truman" "Harry"      "S "
>  >  patched     : "Truman,Harry S"  "Truman" "Harry S"
>  >

> Please note that the last output of the patched version does not
> follow the RIS specs, therefore it is not clear whether RefMan,
> EndNote and the like import this properly.

Thanks for this precision. I personally do not care. I obviously never
pretended to stay compatible with a format while arguing it is flawed.
Once again, that's the reason I did not even tried to put this patch
in some sourceforge tracker.

> As stated above, you should not use periods anyway as they are not
> required. Following this simple rule will make most of your complaints
> obsolete.

Unfortunately, this simple "no-periods" rule is not acceptable to me.
Please do not forget to put it in the documentation, I really think it
is important.

>  >                           RISX input examples
>  >
>  >                                 "Smith"   "F."      "M."    "N."
>  >                                 "Truman"  "Harry"   "S"
>  >                                 "Chu"     "H.K."    "Jerry"
>  >
>  >                     ->    database results
>  >
>  >  official :  "Smith,F.M.N."     "Smith"   "F"      "M N"
>  >  patched  :  "Smith,F. M. N."   "Smith"   "F."     "M. N."

> Whether or not to use spaces after initials is a formatting issue that
> is handled by the bibliography style. A period is enough as a
> separator for the internal representation. The spaces are redundant
> and bloat the data without a reason.

A period is not a decent separator, since it may be part of user data.
Period.

>  >  official :  "Chu,H.Jerry"      "Chu"     "H"      "Jerry"    (informa=
tion loss!)
>  >  patched  :  "Chu,H.K. Jerry"   "Chu"     "H.K."   "Jerry"
>  >
>
> Please provide the RISX input that you used for this example.

It's just above (below "RISX input"). I used double quotes " instead
of XML <tags>, to avoid clutter.

> The following input works just fine for me without any loss of data:
>
>       <author>
> =09<lastname>Chu</lastname>
> =09<firstname>H</firstname>
> =09<middlename>K</middlename>
> =09<middlename>Jerry</middlename>
>       </author>

It does not work, because the firstname is: "H.K.", while the nickname
is "Jerry" (a "nickname" which is by the way a bit far from a
so-called "middlename"... anyway)

It seems you cannot express "H.K." with the RIS syntax, since it uses
the period as a separator. What we see here, is the combination of a
culture-specific concept (middlename), with a flawed syntax (period as
separator). Maybe you should inform the author he mistyped his name,
since it does not conform to the RIS syntax. And oh no, please do not
tell me about the ugly and overcomplicated:  "H.-K."...

> Please note that in the official examples given above, most of the
> output is correct although an improper input format was used. This
> is what normalization is all about.

You explained just above that the main and noble purpose of
normalization is to avoid false duplicates. But here you go much
further since you:
- normalize the data from the typist completely and
irreversibly, and so the output.
- even ask him to make his _input_ RIS-compliant.

> The only problem that I've come across while looking at these
> examples is that the current implementation does not handle
> abbreviated double names very well. "Schleifer,Karl-Heinz" is ok,
> but "Schleifer,K.-H." will cause problems to the best of my
> knowledge. I'll look into this and fix it if necessary.

My patch does a simple fix to this: it drops the period as a
separator, using only spaces. That's all. I know it's not
RIS-compliant anymore, but I do not care, since I never used it and
will never since it=A0is flawed. And I manage false duplicates by hand,
which admittedly sucks, but hey, this is only a one-page long patch.

By the way, the only specification about RIS names syntax I could find
is here: <http://www.refman.com/support/risformat_tags_02.asp> and it
says nothing about periods nor middlenames. Do you have a better
reference? and... publically available?

Thanks for the time to answer, and thanks again for refdb. Quoting
you to conclude:
> Otherwise this is an example of the beauty of free software. If you
> code this for yourself, everyone can have it his way.

Cheers,

Marc.