[Refdb-users] "reversibility" patch
Status: Beta
Brought to you by:
mhoenicka
From: Marc H. <mar...@fr...> - 2004-01-05 22:44:30
|
> > I'll be happy to add a section to the docs in all caps and a red box > > around it stating that author names will be normalized for the sake o= f > > consistency. I could not find this yet in <http://refdb.sourceforge.net/manual-0.9.4/book1.html> Explaining "how" they are normalized also seems rather vital to me. > > > OK: I suggest one *extremely* simple improvement to this code: the > > > ability to disable it, at least at configure time (I will code thi= s > > > for myself in any case). > > Otherwise this is an example of the beauty of free software. If you > > code this for yourself, everyone can have it his way. It's done. See: <http://marc.herbert.free.fr/refdb/reversible/> or below/attached. Comments welcome (including from you, Markus :-) BTW, while testing and comparing, I found some quirks that do not seem to fit _any_ logic (as opposed to: not fit my taste). Cheers, Marc. ---------------------------------------------------------- The "reversible" refdb patch Marc Herbert $Date: 2004/01/05 21:30:50 $ $Revision: 1.2 $ ---- The issue ---- Currently, refdb tries to "normalize" authors' name inputed in the database, in order to avoid false duplicates and maybe to cope with weird requirements of some bibliographic stylesheets. This means fiddling with full stops and so-called "middlenames". I think refdb should either reliably perform this normalization according to a documented, reviewed and formal specification -- or not at all. Today it does it in an undocumented way, silently modifying some user data with potential information loss in corner cases. This (short and simple) refdb patch disables all modifications of user-data, and lets the user decide by himself how names should be "normalized" (assuming it's both desirable and possible). Thanks to it, what gets _in_ refdb, gets _out_ untouched. For instance, if you enter "Harry S Truman" in refdb, you would get back: - without this patch: "Harry S. Truman" - with this patch: "Harry S Truman" (amazing! and "reversible"...= ) Warning: this patch may or may not break further formatting by some bibliographic stylesheets, depending if they expect "normalized" names from the database. I do not care much about breaking stylesheets that want you to change the way you write your name (probably in a more "english" way). I do not mind if they munge names when formatting for publication, but pushing this "normalization" up to the database is not acceptable to me. After all, respectful and less rigid formatting tools also (co-)exist. The answer to this question is likely to be in the following function: backend-dbiba.c:format_firstmiddlename() By the way, be aware that you should NOT use spaces at the beginning or at the end of RISX <name>(s), since this will lead to false duplicates in the database _independently from this patch_. On the other hand, RIS input (AU - field) is more or less space-insensitive. This patch is compatible with version 0.9.4-pre3, and _not_ with version 0.9.3. Users (yet...) satisfied with current refdb behaviour and thus not directly interested by this patch, may still be interested in understanding how their data is modified; just having a look at this patch will provide detailed answers. The summary of changes just below also explains (in english instead of C). This patch also disables middlename(s) input in the RIS format, due to a flawed RIS input syntax, and due to their controversial nature (see http://sourceforge.net/mailarchive/forum.php?forum_id=3D1798&viewmonth=3D= 200312); all RIS "given names" go together untouched into the "firstname" database field. On the other hand, RISX <middlename>s are not disabled by this patch. To disable middlenames in RISX, just... don't use the tag <middlename>. ---- Detailed issues and modifications ---- The SQL database uses 4 (redundant) fields to store author names: fullname, lastname, firstname, middlenameS __________________________ Modifications to RIS input (i.e., "addref -t ris") firstname/middlenames parsing is disabled. - the patch disables fiddling with full stops. - middlenames are disabled: inside the AU field, the whole "given name" as delimited by commas, goes into the "firstname" database field. RIS input examples Smith, F.M.N. Chu, H.K. Jerry Truman, Harry S -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F.M.N." "Smith" "F.M.N." official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" official : "Truman,Harry S." "Truman" "Harry" "S " patched : "Truman,Harry S" "Truman" "Harry S" (also notice the spurious space ending some middlenames with the official version). ____________________________ Mmodifications to RISX input (i.e., "addref -t risx") - full stops "tricks" are disabled RISX input examples "Smith" "F." "M." "N." "Truman" "Harry" "S" "Chu" "H.K." "Jerry" -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F. M. N." "Smith" "F." "M. N." official : "Truman,Harry S." "Truman" "Harry" "S" patched : "Truman,Harry S" "Truman" "Harry" "S" official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informatio= n loss!) patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" _______ Outputs No output expect bibtex's is modified. RIS output dumps "as is" the first field of the SQL database (fullname). RISX output uses the 3 other fields (last, first, middles). It dumps last and firstname untouched, then parse the "middlenames" field according to spaces before dumping <middlename>s elements. The patch does modify neither RIS nor RISX output. Most other outputs also work one way or the other, and are not modified by the patch. However, for some unknown reason, bibtex output pulls the fullname from the database and parses it again, so a small patch was needed here again to prevent the addition of full stops. __________ Convertors The "nmed2ris" convertor also fiddles with authors' names in a similar way. I can not yet say more about this, sorry: I do not use the MED=A0format at all and could not have tested modifications. ________ Feedback Since all this is unfortunably complicated, the probability that I missed something despite all my efforts is non-zero. I thank you in advance for any feedback. ___________________________ The art of Unix Programming Some food for thought from: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Transparency: design for visibility to make inspection and debugging easier. For a program to demonstrate its own correctness, it needs to be using input and output formats sufficiently simple so that the proper relationship between valid input and correct output is easy to check. Rule of Least Surprise: In interface design, always do the least surprising thing. |