Re: [Refdb-users] "reversibility" patch
Status: Beta
Brought to you by:
mhoenicka
From: Marc H. <mar...@fr...> - 2004-01-07 14:26:42
|
On Tue, 6 Jan 2004, Markus Hoenicka wrote: > > The purpose of the name mangling is to reduce all names consistently > to the RIS input format. This is currently the common denominator of > both RIS and RISX input until a richer data format like MODS is > implemented. This is one of the core issues indeed (thanks for starting with it :-) The idea of trying to avoid false duplicates is great, even if it will never be 100% reliable and always depend for some part from the human typist (think for instance about: abbreviated or full input?). But using the RIS input format to implement it is so wrong to me that I prefer to forget about it for the moment. It's a tradeoff, and you surely can understand that I have a different perspective than you about it. Knowing in advance that our opinions differ, I posted this patch on my web page instead of in some sourceforge tracker. I can't see any FUD=A0in this, just a different use of your software. In some "MODS" future, if a name reduction format/scheme that I trust is available, then I will be happy to give it my data. Meanwhile, I prefer to keep it intact, since it does not fit into the RIS format. > If the name mangling is not consistent, then it is a bug > that needs to be fixed, not a feature that needs to be removed. Great! Unfortunately, if the syntax of the target format (RIS) is flawed from the start, you cannot achieve full consistency, whatever your efforts are. Moreover, even "half-consistency" becomes harder and prone to overcomplicated code and bugs, as we see. > The bottom line is: if you supply your RIS data according to the RIS > input format, they won't be fiddled with at all. If you use a > different format, e.g. by leaving out periods or by adding random > spaces, RefDB attempts to mangle the data until they fit the RIS input > format. This works in many cases, but may fail in border cases. This is crystal clear. Now my point: I care about border cases, and I don't care about false duplicates. So I disable mangling. Simple! This is a bit exaggerated, but you get the point. By the way, about border cases: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Repair: Repair what you can -- but when you must fail, fail noisily and as soon as possible. Software should be transparent in the way that it fails, as well as in normal operation. It's best when software can cope with unexpected conditions by adapting to them, but the worst kinds of bugs are those in which the repair doesn't succeed and the problem quietly causes corruption that doesn't show up until much later. Sorry, I do not have so many different..=A0references :-) >=A0The important thing to understand is that the dots and spaces used in t= he > RIS input format do not have anything to do with the final > representation of a name in a formatted bibliography. > The sole purpose of the dots and spaces is to separate the name > parts in order to tell the parser where to chop. The important thing to understand is that dots may be meaningful to some author name in some language (including english), so they are the not far from the worst separator ever. > You could use slashes or question marks just as well. If the designer of the RIS=A0format had had some clue, he could have spared us a lot of discussion and time. >=A0As it is the job of a bibliography software to output the > author names in all possible formatting variations, it is essential > not to store pre-formatted data in the database. > However, it may be useful (see below) to store pre-parsed data. Great, something we agree about! :-) > The same principle basically applies to the RISX input > format. However, the RISX format provides separate elements for the > name parts, so there is no need for textual separators at all. There > is no point to enter a middle initial as > <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." The middlename maybe "B" without being an initial. More generally, the existence or the non-existence of the dot maybe an information that some refdb user does not want to lose, at least not in the database (even if he does not care about some formatted output). > is a representation of a middle name which is used in some > bibliography styles (others don't use the dot or leave out the middle > name altogether) and can be trivially generated from "B". Therefore, a > <middlename>B</middlename> is all you need. If RefDB detects the > superfluous dot, it will remove it. I am really hopeless about making you understand how and why I disagree with: "the superfluous dot". Can't you just accept it as a fact? I also disagree with the "middlename" concept, but this was another story :-) > This is the key point why we have to argue at all. You do not > understand that the database does not contain a formatted string that > shows how you would like to see your name printed on a piece of > paper. > The database contains the name parts, plus a normalized > representation for speeding up queries that happens to look like some > formatted representation. When creating a bibliography, RefDB then has > to assemble the name parts in a fashion that matches the requirements > of the publisher. > It is irrelevant how the cited author or the author > writing the paper would like to represent that name. You really do not understand that, if some information is lost in this great process, whatever its noble purpose is, some author may _never_ see his name printed as he would like to, even when some stylesheets allow it. I will suggest in a next message (this one is already too long and too chaotic, and I still have to think a bit about it) a better solution that may please everyone (no more trade-off). Assuming you understand that I have slighty different refdb needs, so we can discuss about it. > > __________________________ > > Modifications to RIS input > > (i.e., "addref -t ris") > > > > [...] > > > RIS input examples > > > > Smith, F.M.N. > > Chu, H.K. Jerry > > Truman, Harry S > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F.M.N." "Smith" "F.M.N." > > > > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" > > > > official : "Truman,Harry S." "Truman" "Harry" "S " > > patched : "Truman,Harry S" "Truman" "Harry S" > > > Please note that the last output of the patched version does not > follow the RIS specs, therefore it is not clear whether RefMan, > EndNote and the like import this properly. Thanks for this precision. I personally do not care. I obviously never pretended to stay compatible with a format while arguing it is flawed. Once again, that's the reason I did not even tried to put this patch in some sourceforge tracker. > As stated above, you should not use periods anyway as they are not > required. Following this simple rule will make most of your complaints > obsolete. Unfortunately, this simple "no-periods" rule is not acceptable to me. Please do not forget to put it in the documentation, I really think it is important. > > RISX input examples > > > > "Smith" "F." "M." "N." > > "Truman" "Harry" "S" > > "Chu" "H.K." "Jerry" > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F. M. N." "Smith" "F." "M. N." > Whether or not to use spaces after initials is a formatting issue that > is handled by the bibliography style. A period is enough as a > separator for the internal representation. The spaces are redundant > and bloat the data without a reason. A period is not a decent separator, since it may be part of user data. Period. > > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informa= tion loss!) > > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" > > > > Please provide the RISX input that you used for this example. It's just above (below "RISX input"). I used double quotes " instead of XML <tags>, to avoid clutter. > The following input works just fine for me without any loss of data: > > <author> > =09<lastname>Chu</lastname> > =09<firstname>H</firstname> > =09<middlename>K</middlename> > =09<middlename>Jerry</middlename> > </author> It does not work, because the firstname is: "H.K.", while the nickname is "Jerry" (a "nickname" which is by the way a bit far from a so-called "middlename"... anyway) It seems you cannot express "H.K." with the RIS syntax, since it uses the period as a separator. What we see here, is the combination of a culture-specific concept (middlename), with a flawed syntax (period as separator). Maybe you should inform the author he mistyped his name, since it does not conform to the RIS syntax. And oh no, please do not tell me about the ugly and overcomplicated: "H.-K."... > Please note that in the official examples given above, most of the > output is correct although an improper input format was used. This > is what normalization is all about. You explained just above that the main and noble purpose of normalization is to avoid false duplicates. But here you go much further since you: - normalize the data from the typist completely and irreversibly, and so the output. - even ask him to make his _input_ RIS-compliant. > The only problem that I've come across while looking at these > examples is that the current implementation does not handle > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, > but "Schleifer,K.-H." will cause problems to the best of my > knowledge. I'll look into this and fix it if necessary. My patch does a simple fix to this: it drops the period as a separator, using only spaces. That's all. I know it's not RIS-compliant anymore, but I do not care, since I never used it and will never since it=A0is flawed. And I manage false duplicates by hand, which admittedly sucks, but hey, this is only a one-page long patch. By the way, the only specification about RIS names syntax I could find is here: <http://www.refman.com/support/risformat_tags_02.asp> and it says nothing about periods nor middlenames. Do you have a better reference? and... publically available? Thanks for the time to answer, and thanks again for refdb. Quoting you to conclude: > Otherwise this is an example of the beauty of free software. If you > code this for yourself, everyone can have it his way. Cheers, Marc. |