Re: [Refdb-users] "reversibility" patch

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Marc Herbert writes:
 > using the RIS input format to implement it is so wrong to me that I
 > prefer to forget about it for the moment. It's a tradeoff, and you

Fine with me. The MODS-based data format will allow more flexibility
in marking up names, but this does not obviate the need to normalize
the names and the need to stick to an input format. The main
difference that I see is a full support for "prime given" vs. "given"
names regardless of their position. This will hold the distinction
between what's currently called first and middle names, but without
implying any sequence.

 > > The bottom line is: if you supply your RIS data according to the RIS
 > > input format, they won't be fiddled with at all. If you use a
 > > different format, e.g. by leaving out periods or by adding random
 > > spaces, RefDB attempts to mangle the data until they fit the RIS input
 > > format. This works in many cases, but may fail in border cases.
 > 
 > This is crystal clear. Now my point: I care about border cases, and I
 > don't care about false duplicates. So I disable mangling. Simple! This
 > is a bit exaggerated, but you get the point.
 > 

No, you don't need to disable mangling. You simply have to supply the
names according to the RIS specs, then they won't be mangled. And if
you start to use the extended notes, you'll probably start to worry
about duplicates in the author table.

 > >=A0The important thing to understand is that the dots and spaces used in t=
 > he
 > > RIS input format do not have anything to do with the final
 > > representation of a name in a formatted bibliography.
 > 
 > > The sole purpose of the dots and spaces is to separate the name
 > > parts in order to tell the parser where to chop.
 > 
 > The important thing to understand is that dots may be meaningful to
 > some author name in some language (including english), so they are
 > the not far from the worst separator ever.

I'll await your examples. Abbreviated middle names in the
angloamerican culture, so-called middle initials, are not an example
for this. An initial is a capital letter by definition. You may
represent your middle name in formatted output by appending a dot to
the initial, but you don't have to. You can leave out the dot, or
spell out your middle name. The initial is the data, the
initial plus the dot is one of several possible representations of
your middle name, i.e. it contains formatting information that does
not belong into a database.

 > The middlename maybe "B" without being an initial. More generally,
 > the existence or the non-existence of the dot maybe an information
 > that some refdb user does not want to lose, at least not in the
 > database (even if he does not care about some formatted output).

The dot is no information. It is formatting. Please separate data from
formatting. Roosevelt's middle name was not "D.", therefore "D."
cannot appear as a piece of data in the database. Roosevelt's middle
name was "Delano". "D." is one of several possible ways to format his
middle name. The dot does not convey any additional information even
if you know only the initial and not the full name. The border cases
like names that consist of a single letter (any examples?) will be
handled gracefully only in an XML-based input format like MODS - by
providing an appropriate attribute, not by fiddling with dots.

 > Thanks for this precision. I personally do not care. I obviously never
 > pretended to stay compatible with a format while arguing it is flawed.
 > Once again, that's the reason I did not even tried to put this patch
 > in some sourceforge tracker.

I have to care as one of the goals of RefDB was to implement a
reference manager that can exchange data with commercial tools.

 > > Whether or not to use spaces after initials is a formatting issue that
 > > is handled by the bibliography style. A period is enough as a
 > > separator for the internal representation. The spaces are redundant
 > > and bloat the data without a reason.
 > 
 > A period is not a decent separator, since it may be part of user data.
 > Period.

A period is either a textual separator (in an input format) or
formatting (in a printed representation of the name), but no user
data.

 > > The following input works just fine for me without any loss of data:
 > >
 > >       <author>
 > > =09<lastname>Chu</lastname>
 > > =09<firstname>H</firstname>
 > > =09<middlename>K</middlename>
 > > =09<middlename>Jerry</middlename>
 > >       </author>
 > 
 > It does not work, because the firstname is: "H.K.", while the nickname
 > is "Jerry" (a "nickname" which is by the way a bit far from a
 > so-called "middlename"... anyway)
 > 
 > It seems you cannot express "H.K." with the RIS syntax, since it uses
 > the period as a separator. What we see here, is the combination of a
 > culture-specific concept (middlename), with a flawed syntax (period as
 > separator). Maybe you should inform the author he mistyped his name,
 > since it does not conform to the RIS syntax. And oh no, please do not
 > tell me about the ugly and overcomplicated:  "H.-K."...

This depends on how this name spells out. I know he's Chinese but
would it spell "Hans Karl" or "Hans-Karl"? And no, nicknames are no
part of RIS, but I haven't seen a nickname in a citation either. And
again, RefDB will not support names that can't be expressed in RIS
syntax until a MODS-based data format is implemented.

 > > Please note that in the official examples given above, most of the
 > > output is correct although an improper input format was used. This
 > > is what normalization is all about.
 > 
 > You explained just above that the main and noble purpose of
 > normalization is to avoid false duplicates. But here you go much
 > further since you:
 > - normalize the data from the typist completely and
 > irreversibly, and so the output.
 > - even ask him to make his _input_ RIS-compliant.
 > 

No, I've explained this previously and will do it again: If you input
your data according to the specs, they won't be mangled. If you insist
on using a different input format, RefDB will do it's best to use
these data anyway but may fail in border cases. And it is of course
mandatory to provide the input in a RIS-compliant format as the data
format is based on RIS. I'm surprised that this seems new to you.

 > 
 > > The only problem that I've come across while looking at these
 > > examples is that the current implementation does not handle
 > > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok,
 > > but "Schleifer,K.-H." will cause problems to the best of my
 > > knowledge. I'll look into this and fix it if necessary.
 > 
 > My patch does a simple fix to this: it drops the period as a
 > separator, using only spaces. That's all. I know it's not
 > RIS-compliant anymore, but I do not care, since I never used it and
 > will never since it=A0is flawed. And I manage false duplicates by hand,
 > which admittedly sucks, but hey, this is only a one-page long patch.

It's fixed in CVS.

regards,
Markus

-- 
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de