[Refdb-devel] [ refdb-Bugs-2935197 ] AU field: Once capitalized, always capitalized

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Bugs item #2935197, was opened at 2010-01-19 19:59
Message generated for change (Comment added) made by akusmin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=385991&aid=2935197&group_id=26091

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: refdbd
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: akusmin (akusmin)
Assigned to: Markus Hoenicka (mhoenicka)
Summary: AU field: Once capitalized, always capitalized

Initial Comment:
I add citations to my RefDB databases using the command  "addref foo.ris "
where foo.ris is the file containing citations. When in one file there are two citations sharing the same author,
and if in the first such citation the author name is capitalized (FOONAME,F.B while in the second it is not (Fooname,F.B) then, 
after citations are added, the command getref returns the *capitalized* author name for both citations. 

Moreover, if there are three citations sharing the same author, the first citation contains AU field  Fooname,Frank B.  the second FOONAME,F.B
and the third Fooname,F.B, then, after adding references from this file, again, the command getref returns FOONAME,F.B.

Why I am complaining:

I get citations from ISI Web of Science, as you know, until 1995/1996 Author names and Titles are capitalized. Thus, some of citations
have capitalized names and some not.  When I make bibliography for my LaTeX document, sometimes
I see capitalized author names.

May be this is already fixed in the SVN version? could you please check?
Thank you.

P.S.
Attached is the zip file with three ris files. In ex1.ris  the citation with AU Stuhrmann, H.B, the second file contains one more citation (the very first) where AU has STUHRMANN,H.B, and the third files contains three citations, where the 1st has   STUHRMANN,Heinrich.B.

----------------------------------------------------------------------

Comment By: akusmin (akusmin)
Date: 2010-01-23 18:00

Message:
1)  *** Now assume that in a twist of fate you first add the all-caps
version of
the author and then the properly capitalized one. ***
I thought about it; I just did not mention how to take it into account.
Besides, it  seems I was not clear.
Inside a refdbd database, every citation will have two fields, one is AU
field, another is capitalized Au field, let's call it  AU-CAP.

Suppose the first citation having an author Fooname had an AU field
FOONAME. Then both AU and AU-CAP fields are set to 
FOONAME. However, if at some point a citation added where an AU field for
Fooname is a properly capitalized, then 
AU fields for all citations with AU-CAP fields equal to FOONAME are
updated with a new AU field which is Fooname.

The only problem is, of course, what if in the database there is only one
citation with AU field FOONAME, 
we make a bibliography, and oops, we will have to correct the
capitalization by hand.

2) 
**You may have mistaken my suggestion about an input filter. I'm not
asking
every user to develop such scripts.  ***

No, I understood that you meant a script which will be included in RefDB
sources.
And I agree, this is a good idea.

However, then,  there will be one filter for Pubmed, another for ISI, one
for something else etc.
I agree, one can automatically detect the format (ISI, Pubmed etc) and
invoke а suitable script.

If the only problem is capitalization, this can be solved in a way (for
example) similar to what I wrote above.

By the way, you may know that ISI uses a bunch of non standard CY tags,
e.g. "SO" means source, it's like JF but not always;
CY means conference year etc.  In principle, a script that converts ISI
citations to RIS  would be a nice add-on for RefDB users.
I wrote such a script in shell + awk  (pretty naive), but I guess you
could write a better script using Perl.

OK, here is the summary:
I think in general, a scipt for ISI, script for PUBMED etc is the probably
the best way because there are probably database-related differences other
than capitalization.  But capitalization problem could also be solved by
changing the internal mechanisms of refdbd, and this method is
source-of-citations-independent.

----------------------------------------------------------------------

Comment By: Markus Hoenicka (mhoenicka)
Date: 2010-01-22 00:38

Message:
Now assume that in a twist of fate you first add the all-caps version of
the author and then the properly capitalized one. If refdbd was changed in
the way you suggest, you'd end up having all appearances of this author in
all-caps. I'm afraid there's no way around manual intervention.

You may have mistaken my suggestion about an input filter. I'm not asking
every user to develop such scripts. I was rather thinking about including
such a script in the RefDB sources. It is fairly easy to automatically
invoke import filters before adding your data. I personally use Makefiles
to deal with Pubmed data. "make" converts existing Pubmed data to a ris
file. "make edit" allows to enter the reprint status, path to an offprint
etc in an editor. "make install" finally adds the data to the database.
Once set up properly, you don't even have to know the names of your input
filters.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2010-01-21 20:09

Message:
Regarding the two drawbacks:
1) So far I dealt with this drawback by looking for a particular citation,
fixing author names in this citation, and using updateref. 
Of course, in the long run, it is not as efficient as de-capitalizing all
author names by a script prior using addref <filename>

2) I thougt that one could use as author name a non-capitalized name. 
Example: Suppose there is only one citation where author is Fooname,K. 
Associated with this citation is another field: capitalized author name. We
use addref to add another citation , where the name is capitalized,
FOONAME,K.  In the process of adding this citation, its capitalized name is
compared with available capitalize names, after the match is found, for the
new citation the non-capitalized name is set to be the same as in the first
citation.

It seems to me that similar method is already used in RefDB for some other
field (but I am not sure).
Thus, the argument about internal normalization does not look convincing
to me.

*** Wouldn't it be more straightforward to develop a Perl script which
touches
up older data from ISI Web? ***
Yes, it would. In fact, I have an awk script which does this for AU and JF
fields. Of course, it is easy to write such a script.

I am not arguing for the sake of argument; I am trying to express the
point of view of some what lazy "average user",
who can't write a Perl/Python/awk etc script, who just wants to add
citations from some source and wants 
various fields to be decapitalized.

I agree with Unix philosophy "One job - one tool", according to this
philosophy, may be it is better to have some script,
which is applied on a file with citations before this file is added by
addref. 
However,  maybe it is not too bad to use some mechanism like described in
2) 

Anyway, don't take it too seriously. It is not a big issue.   

regards,
André

----------------------------------------------------------------------

Comment By: Markus Hoenicka (mhoenicka)
Date: 2010-01-20 13:18

Message:
Please don't take me wrong, I'd be happy to change the code if this fixed
your problem once and for all. However, the obvious fix has two drawbacks:
1) some of your author names will still be all uppercase, which is simply
not desirable. If you create bibliographies from such entries, they'll look
odd. 2) it breaks the database normalization as it will allow to have the
same author be represented by two separate entries which differ only in
case. 

That is, the fix deals with one problem, leaves another one untouched and
creates a different one. This is why I'm reluctant here.

Wouldn't it be more straightforward to develop a Perl script which touches
up older data from ISI Web? We have similar tools, e.g. for the broken RIS
that EndNote exports. Feel free to post a bunch of sample reference data to
check whether it is worth a try.

----------------------------------------------------------------------

Comment By: akusmin (akusmin)
Date: 2010-01-20 11:58

Message:

1) Regarding example3.ris : I see your point and I agree.
2) Regarding capitalized name overriding the non-capitalized one.
*** I still think it is wisest to
clean up your input data, the additional effort notwithstanding.***
Again; I see your point, and I was thinking that I could that with all my
*old* files with citations using awk/ sed and grep .
(For *new* files I don't have this problem because   when I add citation
from ISI Web , I decapitalize author names).

However, in general,  if this problem can be fixed with a relatively small
effort and without making addref significantly slow,
it should be fixed.
It is not a big issue, sure, may be we should call it a "feature request".

Anyway: I use RefDB since 2006 and I like it. Thank you for your work!

----------------------------------------------------------------------

Comment By: Markus Hoenicka (mhoenicka)
Date: 2010-01-19 22:57

Message:
example3.ris is misleading in your case because "Stuhrmann,H.B." and
"Stuhrmann,Heinrich B." must be treated as separate entries by RefDB
because only a human being can decide whether these are the same persons,
regardless of capitalization. If you expect the database to treat these
strings as the same author, you'd have to maintain your input files
accordingly and settle on one version, preferably the one with the first
name spelled out.

This leaves the problem of "STUHRMANN,H.B." apparently outsmarting
"Stuhrmann.H.B.". When adding a reference entry, refdbd tries to find
existing authors in the database. If the author already exists, refdbd
simply sets a "link" to that author. Only if it doesn't find an existing
author, a new author entry is created. Testing for duplicates is done by a
SQL expression using the "=" operator. Apparently some database engines
treat this as a case-insensitive string comparison. I can reproduce your
results with MySQL,  but not with PostgreSQL. I have to admit that I've
never figured this might pose a problem. I still think it is wisest to
clean up your input data, the additional effort notwithstanding.

In order to fix this problem programmatically, I'd have to replace the "="
comparison with something that causes a case-sensitive comparison on all
database engines.

regards,
Markus

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=385991&aid=2935197&group_id=26091