[Refdb-users] RE: reversibility patch (cumulative reply)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

it's getting too tedious for me to answer all ramifications of this
discussion in due detail, so I'll try to pick up a few loose ends from
the previous mails. This will hopefully settle a few issues.

- As a meta-note, I believe much of the confusion in this discussion
  is derived from the fact that we have a hard time to keep input
  formats and formatted author names apart. The input format has to
  supply as much information as possible in a parseable format. The
  input format is therefore not a free string. The output formats have
  to stick to the publisher's specifications even if we consider these
  specs stupid at times.

- if the discussion is only about "so-called \"middle names\"" (I
  prefer to call them what they are: either middlenames or middle
  initials), then we can indeed get to a short conclusion. At least in
  the life sciences, the following possibilities to output first and
  middle names are common practice:

  FM, F.M., F. M., First M, First M., First Middle, F, F., First

  This is independent of how the bearer of that name prefers to read
  his name. A reference manager that wants to support all variants
  ideally knows "First" and "Middle" (as separate entities) as all
  other variants can be derived from them. If you can't track down the
  full names (or can't verify whether "M" is a name as such or an
  abbreviation), there's not much you can do but use the abbreviations
  instead.

  It is quite true that Doris J. Delorie and DJ Delorie end up with
  the same formatted name if you use the first output style. However,
  this is not a bug in RefDB, this is a design decision of the
  publisher requiring that format. I can't argue about this, I just
  have to support it.

  The RIS input format is suitable to supply the full names or the
  abbreviations. It is weak in that it can't distinguish abbreviated
  names from one-letter non-abbreviated names. It also doesn't support
  "prime given names" which are not in the first position. These flaws
  will be addressed by switching over to a MODS-based format.

  BTW the Pubmed XML format (the output of the largest literature
  database in the life sciences) uses elements along the lines of
  "first", "middle", "last", "honorific".

- Just like TeX itself, BibTeX has been designed by a mathematician
  for publications in mathematical journals. It is widely used in
  mathematics, computer sciences, and engineering. The BibTeX data
  format is apparently sufficient for publications in these fields. As
  Bruce pointed out though, we should not use BibTeX as a golden
  standard. The input format is flawed compared to what XML
  has to offer. TeX/BibTeX is not accepted by most journals in the
  life sciences anyway, partly because it does not support the
  citation and bibliography requirements of these journals.

- As far as I understood the ALWD format (a legal citation style
  asking for the name "exactly as it appears on the front cover or
  title page") is probably not as flawed as I thought in the first
  moment. All examples shown in the available docs (I don't own the
  actual manual, though) use names in the natural order, that
  is "Franklin D. Roosevelt" or "Luis Lopez Penabad". I think we agree
  that this is entirely unsuitable as an input format as it is not
  parseable in any way. We still have to record this formatted string
  in addition to the parseable data if we want to support
  ALWD. Needless to say that RIS has no means to do this. A MODS-based
  input format will have.

My conclusions are:

- RIS is and remains flawed. There is no point to fiddle with it as
  you don't gain much but break a lot and lose compatibility with
  commercial tools. The best strategy is to accept the limitations and
  treat the current implementation of RefDB as a "compatibility mode".

- XML is the way to go, along with an improved data model. Something
  like the following should be sufficient to handle most names. The
  following examples assume that you don't have the full information
  about all name parts and use some abbreviations instead. If you
  *had* the information, you'd certainly enter "Jessica" instead of
  "J".

  <name>
  <namePart type="primegiven">Doris</namePart>
  <namePart type="given" abbrev="yes">J</namePart>
  <namePart type="family">Delorie</namePart>
  <displayForm>Doris J. Delorie</displayForm>
  </name>

  <name>
  <namePart type="primegiven">DJ</namePart>
  <namePart type="family">Delorie</namePart>
  <displayForm>DJ Delorie</displayForm>
  </name>

  <name>
  <namePart type="given" abbrev="yes">H</namePart>
  <namePart type="given" abbrev="yes">K</namePart>
  <namePart type="primegiven">Jerry</namePart>
  <namePart type="family">Chun</namePart>
  <displayForm>H.K. Jerry Chun</namePart>
  </name>

  <name>
  <namePart type="primegiven">Harry</namePart>
  <namePart type="given">S</namePart>
  <namePart type="family">Truman</namePart>
  <displayForm>Harry S. Truman</displayForm>
  </name>

  The displayForm element is used nowhere except in the ALWD style and
  any other style that wants the name exactly as printed on the cited
  work (this is not necessarily identical with how the author wants
  his name printed - the actual string may follow the conventions of
  the publisher of the cited work rather than the author's
  preference). This is also why Truman has a dot after his middle
  non-initial because it was just so spelled on that particular book.

  Please note also that the parseable data make do without any dots.

  We'll have to push the MODS people a little in order to support the
  required attributes. The current MODS implementation is about as
  flawed as RIS in this respect, but as it is an open standard which
  is still evolving we have at least a chance to get this fixed.

regards,
Markus

-- 
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de