[Refdb-users] "reversibility" patch
Status: Beta
Brought to you by:
mhoenicka
From: Markus H. <mar...@mh...> - 2004-01-06 15:40:19
|
Marc Herbert writes: >=20 > > > I'll be happy to add a section to the docs in all caps and a red= box > > > around it stating that author names will be normalized for the s= ake of > > > consistency. >=20 > I could not find this yet in > <http://refdb.sourceforge.net/manual-0.9.4/book1.html> >=20 > Explaining "how" they are normalized also seems rather vital to me. >=20 Sorry. Didn't get round to it yet. > BTW, while testing and comparing, I found some quirks that do not se= em > to fit _any_ logic (as opposed to: not fit my taste). >=20 In this case you should document them and file a bug report instead of spreading FUD. > ---------------------------------------------------------- >=20 > The "reversible" refdb patch >=20 > Marc Herbert > $Date: 2004/01/05 21:30:50 $ > $Revision: 1.2 $ >=20 >=20 > ---- The issue ---- >=20 > Currently, refdb tries to "normalize" authors' name inputed in the > database, in order to avoid false duplicates and maybe to cope with > weird requirements of some bibliographic stylesheets. This means > fiddling with full stops and so-called "middlenames". >=20 > I think refdb should either reliably perform this normalization > according to a documented, reviewed and formal specification > -- or not at all. Today it does it in an undocumented way, > silently modifying some user data with potential information loss in= > corner cases. >=20 The purpose of the name mangling is to reduce all names consistently to the RIS input format. This is currently the common denominator of both RIS and RISX input until a richer data format like MODS is implemented. If the name mangling is not consistent, then it is a bug that needs to be fixed, not a feature that needs to be removed. The bottom line is: if you supply your RIS data according to the RIS input format, they won't be fiddled with at all. If you use a different format, e.g. by leaving out periods or by adding random spaces, RefDB attempts to mangle the data until they fit the RIS input format. This works in many cases, but may fail in border cases. The important thing to understand is that the dots and spaces used in the RIS input format do not have anything to do with the final representation of a name in a formatted bibliography. The sole purpose of the dots and spaces is to separate the name parts in order to tell the parser where to chop. You could use slashes or question marks just as well. As it is the job of a bibliography software to output the author names in all possible formatting variations, it is essential not to store pre-formatted data in the database. However, it may be useful (see below) to store pre-parsed data. The same principle basically applies to the RISX input format. However, the RISX format provides separate elements for the name parts, so there is no need for textual separators at all. There is no point to enter a middle initial as <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." is a representation of a middle name which is used in some bibliography styles (others don't use the dot or leave out the middle name altogether) and can be trivially generated from "B". Therefore, a <middlename>B</middlename> is all you need. If RefDB detects the superfluous dot, it will remove it. > This (short and simple) refdb patch disables all modifications of > user-data, and lets the user decide by himself how names should be > "normalized" (assuming it's both desirable and possible). > Thanks to it, what gets _in_ refdb, gets _out_ untouched. > For instance, if you enter "Harry S Truman" in refdb, you would get = back: > - without this patch: "Harry S. Truman" > - with this patch: "Harry S Truman" (amazing! and "reversibl= e"...) >=20 Now we get to the purpose of normalization. As stated above, the data in the AU field of a RIS dataset or an <author> element are not strings that are supposed to be inserted into a bibliography as they are. They are input formats that supply data (the name parts) for one object in the database (an author). If an author has several reference entries in the database, these entries must link to the same object (the author), not to a specific representation of the author's name. Assume the following cases: Truman,Harry S. Truman,Harry S Truman, Harry S Truman, Harry S. The first one is what the RIS input format asks for. The others aren't that different except for a space or a dot here and there. If these belong to four references among 100, you probably wouldn't even notice that the author names are written differently, although it is clear that they mean the same author. If you add these four datasets to RefDB, the first entry won't be mangled at all (as it sticks to the rules). The other entries are normalized, and as a consequence, all four references link to the same author. The normalized internal representation of the author name is "Truman,Harry S." (amazing! and "reversible"...). If you go ahead and prevent this normalization, the four references will point to four different author objects, one with the representation "Truman,Harry S.", another one with the representation "Truman,Harry S", and so forth. If you now run a query for references by some "Truman,Harry S.", you'll miss 75% of the possible hits. This is not good. You can obviously work around this weakness of the patch by running all queries against regular expressions, but this is not an option if you design a simplified interface that allows users to pick names from a list (something Mike is currently working on). > Warning: this patch may or may not break further formatting by some > bibliographic stylesheets, depending if they expect "normalized" nam= es > from the database. I do not care much about breaking stylesheets tha= t > want you to change the way you write your name (probably in a more > "english" way). I do not mind if they munge names when formatting f= or > publication, but pushing this "normalization" up to the database is > not acceptable to me. After all, respectful and less rigid formattin= g > tools also (co-)exist. The answer to this question is likely to be = in > the following function: backend-dbiba.c:format_firstmiddlename() This is the key point why we have to argue at all. You do not understand that the database does not contain a formatted string that shows how you would like to see your name printed on a piece of paper. The database contains the name parts, plus a normalized representation for speeding up queries that happens to look like some formatted representation. When creating a bibliography, RefDB then has to assemble the name parts in a fashion that matches the requirements of the publisher. It is irrelevant how the cited author or the author writing the paper would like to represent that name. >=20 > By the way, be aware that you should NOT use spaces at the beginning= > or at the end of RISX <name>(s), since this will lead to false > duplicates in the database _independently from this patch_. On the > other hand, RIS input (AU - field) is more or less space-insensitive= . >=20 The RIS input is insensitive to leading and trailing spaces as the latter are basically invisible in this input format. I have not anticipated that anyone would add stray spaces to XML elements as they are easily detected, but if this is a common problem it could be handled just as well. >=20 >=20 > The SQL database uses 4 (redundant) fields to store author names: > fullname, lastname, firstname, middlenameS >=20 The columns are not redundant. Redundancy implies that they hold the same information but this is not the case. author_lastname, author_firstname, and author_middlename hold the pre-parsed name parts which are different by definition. The author_name field holds the normalized representation of the full name or a corporate name. The latter doesn't have name parts but it can't go into e.g. author_lastname either as we have to distinguish between authors that have only one name and corporate names. The only redundancy in this setup is that a non-corporate name could be assembled from the name parts. However, author names are usually added once and then queried each time someone requests a reference or a bibliography containing that name. For the sake of speed it makes sense to parse the name once (when you add it) instead of each time it is retrieved. > __________________________ > Modifications to RIS input > (i.e., "addref -t ris") >=20 [...] > RIS input examples >=20 > Smith, F.M.N. > Chu, H.K. Jerry > Truman, Harry S >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F.M.N." "Smith" "F.M.N." >=20 > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" >=20 > official : "Truman,Harry S." "Truman" "Harry" "S " > patched : "Truman,Harry S" "Truman" "Harry S" >=20 > (also notice the spurious space ending some middlenames with the > official version). These spaces are due to a bug introduced after adding support for multiple middle names. Fixed in CVS. Please note that the last output of the patched version does not follow the RIS specs, therefore it is not clear whether RefMan, EndNote and the like import this properly. >=20 >=20 > ____________________________ > Mmodifications to RISX input > (i.e., "addref -t risx") >=20 > - full stops "tricks" are disabled >=20 As stated above, you should not use periods anyway as they are not required. Following this simple rule will make most of your complaints obsolete. > RISX input examples >=20 > "Smith" "F." "M." "N." > "Truman" "Harry" "S" > "Chu" "H.K." "Jerry" >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F. M. N." "Smith" "F." "M. N." Whether or not to use spaces after initials is a formatting issue that is handled by the bibliography style. A period is enough as a separator for the internal representation. The spaces are redundant and bloat the data without a reason. >=20 > official : "Truman,Harry S." "Truman" "Harry" "S" > patched : "Truman,Harry S" "Truman" "Harry" "S" >=20 Again, the patched output may not be readable by other tools using RIS.= > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (infor= mation loss!) > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" >=20 Please provide the RISX input that you used for this example. The following input works just fine for me without any loss of data: <author> =09<lastname>Chu</lastname> =09<firstname>H</firstname> =09<middlename>K</middlename> =09<middlename>Jerry</middlename> </author> (the markup is odd but RISX currently does not support something like a "prime" given name which is not in the first position, as in "M. Steven Miller". RIS does not support this either, so this will be handled properly only by the forthcoming MODS-like data model) Please note that in the official examples given above, most of the output is correct although an improper input format was used. This is what normalization is all about. The only problem that I've come across while looking at these examples is that the current implementation does not handle abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, but "Schleifer,K.-H." will cause problems to the best of my knowledge. I'll look into this and fix it if necessary. > However, for some unknown reason, bibtex output pulls the fullname > from the database and parses it again, so a small patch was needed > here again to prevent the addition of full stops. >=20 The "unknown reason" is negligence. I haven't heard positively of anyone using the bibtex output, so this gets somewhat less attention than it should. >=20 > __________ > Convertors >=20 > The "nmed2ris" convertor also fiddles with authors' names in a simil= ar > way. I can not yet say more about this, sorry: I do not use the > MED=A0format at all and could not have tested modifications. >=20 It is clearly stated in the manual that this program is obsolete and will eventually be removed from the distribution. If at all, have a look at the med2ris.pl script. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |