[Refdb-devel] [ refdb-Bugs-2935197 ] AU field: Once capitalized, always capitalized
Status: Beta
Brought to you by:
mhoenicka
From: SourceForge.net <no...@so...> - 2010-01-23 18:00:21
|
Bugs item #2935197, was opened at 2010-01-19 19:59 Message generated for change (Comment added) made by akusmin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=385991&aid=2935197&group_id=26091 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: refdbd Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: akusmin (akusmin) Assigned to: Markus Hoenicka (mhoenicka) Summary: AU field: Once capitalized, always capitalized Initial Comment: I add citations to my RefDB databases using the command "addref foo.ris " where foo.ris is the file containing citations. When in one file there are two citations sharing the same author, and if in the first such citation the author name is capitalized (FOONAME,F.B while in the second it is not (Fooname,F.B) then, after citations are added, the command getref returns the *capitalized* author name for both citations. Moreover, if there are three citations sharing the same author, the first citation contains AU field Fooname,Frank B. the second FOONAME,F.B and the third Fooname,F.B, then, after adding references from this file, again, the command getref returns FOONAME,F.B. Why I am complaining: I get citations from ISI Web of Science, as you know, until 1995/1996 Author names and Titles are capitalized. Thus, some of citations have capitalized names and some not. When I make bibliography for my LaTeX document, sometimes I see capitalized author names. May be this is already fixed in the SVN version? could you please check? Thank you. P.S. Attached is the zip file with three ris files. In ex1.ris the citation with AU Stuhrmann, H.B, the second file contains one more citation (the very first) where AU has STUHRMANN,H.B, and the third files contains three citations, where the 1st has STUHRMANN,Heinrich.B. ---------------------------------------------------------------------- Comment By: akusmin (akusmin) Date: 2010-01-23 18:00 Message: 1) *** Now assume that in a twist of fate you first add the all-caps version of the author and then the properly capitalized one. *** I thought about it; I just did not mention how to take it into account. Besides, it seems I was not clear. Inside a refdbd database, every citation will have two fields, one is AU field, another is capitalized Au field, let's call it AU-CAP. Suppose the first citation having an author Fooname had an AU field FOONAME. Then both AU and AU-CAP fields are set to FOONAME. However, if at some point a citation added where an AU field for Fooname is a properly capitalized, then AU fields for all citations with AU-CAP fields equal to FOONAME are updated with a new AU field which is Fooname. The only problem is, of course, what if in the database there is only one citation with AU field FOONAME, we make a bibliography, and oops, we will have to correct the capitalization by hand. 2) **You may have mistaken my suggestion about an input filter. I'm not asking every user to develop such scripts. *** No, I understood that you meant a script which will be included in RefDB sources. And I agree, this is a good idea. However, then, there will be one filter for Pubmed, another for ISI, one for something else etc. I agree, one can automatically detect the format (ISI, Pubmed etc) and invoke а suitable script. If the only problem is capitalization, this can be solved in a way (for example) similar to what I wrote above. By the way, you may know that ISI uses a bunch of non standard CY tags, e.g. "SO" means source, it's like JF but not always; CY means conference year etc. In principle, a script that converts ISI citations to RIS would be a nice add-on for RefDB users. I wrote such a script in shell + awk (pretty naive), but I guess you could write a better script using Perl. OK, here is the summary: I think in general, a scipt for ISI, script for PUBMED etc is the probably the best way because there are probably database-related differences other than capitalization. But capitalization problem could also be solved by changing the internal mechanisms of refdbd, and this method is source-of-citations-independent. ---------------------------------------------------------------------- Comment By: Markus Hoenicka (mhoenicka) Date: 2010-01-22 00:38 Message: Now assume that in a twist of fate you first add the all-caps version of the author and then the properly capitalized one. If refdbd was changed in the way you suggest, you'd end up having all appearances of this author in all-caps. I'm afraid there's no way around manual intervention. You may have mistaken my suggestion about an input filter. I'm not asking every user to develop such scripts. I was rather thinking about including such a script in the RefDB sources. It is fairly easy to automatically invoke import filters before adding your data. I personally use Makefiles to deal with Pubmed data. "make" converts existing Pubmed data to a ris file. "make edit" allows to enter the reprint status, path to an offprint etc in an editor. "make install" finally adds the data to the database. Once set up properly, you don't even have to know the names of your input filters. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-01-21 20:09 Message: Regarding the two drawbacks: 1) So far I dealt with this drawback by looking for a particular citation, fixing author names in this citation, and using updateref. Of course, in the long run, it is not as efficient as de-capitalizing all author names by a script prior using addref <filename> 2) I thougt that one could use as author name a non-capitalized name. Example: Suppose there is only one citation where author is Fooname,K. Associated with this citation is another field: capitalized author name. We use addref to add another citation , where the name is capitalized, FOONAME,K. In the process of adding this citation, its capitalized name is compared with available capitalize names, after the match is found, for the new citation the non-capitalized name is set to be the same as in the first citation. It seems to me that similar method is already used in RefDB for some other field (but I am not sure). Thus, the argument about internal normalization does not look convincing to me. *** Wouldn't it be more straightforward to develop a Perl script which touches up older data from ISI Web? *** Yes, it would. In fact, I have an awk script which does this for AU and JF fields. Of course, it is easy to write such a script. I am not arguing for the sake of argument; I am trying to express the point of view of some what lazy "average user", who can't write a Perl/Python/awk etc script, who just wants to add citations from some source and wants various fields to be decapitalized. I agree with Unix philosophy "One job - one tool", according to this philosophy, may be it is better to have some script, which is applied on a file with citations before this file is added by addref. However, maybe it is not too bad to use some mechanism like described in 2) Anyway, don't take it too seriously. It is not a big issue. regards, André ---------------------------------------------------------------------- Comment By: Markus Hoenicka (mhoenicka) Date: 2010-01-20 13:18 Message: Please don't take me wrong, I'd be happy to change the code if this fixed your problem once and for all. However, the obvious fix has two drawbacks: 1) some of your author names will still be all uppercase, which is simply not desirable. If you create bibliographies from such entries, they'll look odd. 2) it breaks the database normalization as it will allow to have the same author be represented by two separate entries which differ only in case. That is, the fix deals with one problem, leaves another one untouched and creates a different one. This is why I'm reluctant here. Wouldn't it be more straightforward to develop a Perl script which touches up older data from ISI Web? We have similar tools, e.g. for the broken RIS that EndNote exports. Feel free to post a bunch of sample reference data to check whether it is worth a try. ---------------------------------------------------------------------- Comment By: akusmin (akusmin) Date: 2010-01-20 11:58 Message: 1) Regarding example3.ris : I see your point and I agree. 2) Regarding capitalized name overriding the non-capitalized one. *** I still think it is wisest to clean up your input data, the additional effort notwithstanding.*** Again; I see your point, and I was thinking that I could that with all my *old* files with citations using awk/ sed and grep . (For *new* files I don't have this problem because when I add citation from ISI Web , I decapitalize author names). However, in general, if this problem can be fixed with a relatively small effort and without making addref significantly slow, it should be fixed. It is not a big issue, sure, may be we should call it a "feature request". Anyway: I use RefDB since 2006 and I like it. Thank you for your work! ---------------------------------------------------------------------- Comment By: Markus Hoenicka (mhoenicka) Date: 2010-01-19 22:57 Message: example3.ris is misleading in your case because "Stuhrmann,H.B." and "Stuhrmann,Heinrich B." must be treated as separate entries by RefDB because only a human being can decide whether these are the same persons, regardless of capitalization. If you expect the database to treat these strings as the same author, you'd have to maintain your input files accordingly and settle on one version, preferably the one with the first name spelled out. This leaves the problem of "STUHRMANN,H.B." apparently outsmarting "Stuhrmann.H.B.". When adding a reference entry, refdbd tries to find existing authors in the database. If the author already exists, refdbd simply sets a "link" to that author. Only if it doesn't find an existing author, a new author entry is created. Testing for duplicates is done by a SQL expression using the "=" operator. Apparently some database engines treat this as a case-insensitive string comparison. I can reproduce your results with MySQL, but not with PostgreSQL. I have to admit that I've never figured this might pose a problem. I still think it is wisest to clean up your input data, the additional effort notwithstanding. In order to fix this problem programmatically, I'd have to replace the "=" comparison with something that causes a case-sensitive comparison on all database engines. regards, Markus ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=385991&aid=2935197&group_id=26091 |