Thread: [Refdb-users] Re: The case against <middlename>
Status: Beta
Brought to you by:
mhoenicka
|
From: Marc H. <mar...@en...> - 2003-12-11 10:02:32
|
On Wed, 10 Dec 2003, Markus wrote:
> > I am NOT=3DA0asking to remove the concept of <middlename> down to ev=
ery
> > refdb line code: I am just suggesting to postpone this concept to th=
e
> > rendering stage, so it does not spoil the data model.
> >
>
> It does not spoil the data model to use a human brain for the parsing
> of names.
Right, except that using the RIS syntax imply using a buggy
_automated_ algorithm to do this parsing.
> An unparsed name string is spoilt data.
"unparsed" can obviously not be "spoilt"... but only "rough".
> Marc Herbert writes:
> > Let me reformulate: "lack of detail is better than wrong details". N=
o
> > information is lost by storing all "given" names in <firstname> and
> > not parsing them.
> You lose the information that a human brain can put into parsing the
> name string, using cultural background information that is hard if not
> impossible to teach to a machine.
I am glad to hear this! Then fix the _automated_ RIS parsing/syntax by
adding a comma to it?
> > A style sheet that mandates the use of "middlename" is, to put it
> > mildly, "culture-specific". If it insists on this, then it should be
> > able to extract this information _by itself_, and not spoils the
> > global data model because of this peculiarity. It seems this is
> > exactly how BibTeX's stylesheets work. References given in a previou=
s
> > message seem to show that other formats do it the same way.
> Once again, go complain to the publishers of roughly 5000 journals in
> the life sciences.
I did not know publishers of 5000 life sciences journals where so
english-centric and ignorant of foreign cultures. This bug is quite
amazing.
>=A0I also believe that your argument is moot that if a
> style requires the concept of middle names it should be able to
> retrieve the middle name by itself. With the same argument you could
> dump entirely unparsed strings in any order onto a bib software and
> expect it to figure out how to parse it, as it requires to disginguish
> between given and family names, titles and suffixes.
If I remember well, this discussion is about the right level of detail
to adopt and where. So I find "With the same argument you could dump
entirely unparsed strings" not very constructive.
>=A0This simply expresses your dislike of middle names.
^^^^ !
Please go complain to the publishers of most of the truly
international journals (except life sciences), and to the designers of
all bibliographic formats I've seen (except risx).
I started to "dislike" middlenames, only after doing research and
understanding that almost no one use them.
> > I think this *requirement* is more or less flawed. The more
> > reformatting it requires, the more flawed it is, since the more
> > (wrong) assumptions it will make concerning "name standardization"
> > (i.e., that everybody should have a name that is american-english
> > looking). The worst assumption is of course the requirement of a
> > <middlename>. Assumptions about dots are also flawed, see for
> > instance: <http://www.delorie.com/users/dj/>
> Once again, I didn't invent these requirements. I have to support them
> if I want to support the 5000+ journals in the life sciences.
> > In any case, these dirty issues should not spoil the data model, the=
y
> > should be (and can be!) postponed and solved by the stylesheets
> > _themselves_. So mistakes appear only in some printings, and there
> > are no irreversible mistakes in the data source.
> I don't think it is a brilliant idea to have each of 700+ stylesheets
> (if we consider only the life sciences for a moment) parse and munge
> the names by themselves. Code duplication and bloating would be
> inevitable. I'd rather have stupid simple stylesheets that use the
> preparsed names from the application.
This life-science-specific "middlename parsing" could be factorized
without being put down to the database. So refdb could be used
internationally without bugs and hassles: just by working around it.
Why not adding a "-[no]middlename" option for outputs ?
Same thing for the "clever" abbreviating code.
> > The rationale is here: if middlenames should be kept in the data mod=
el
> > (sigh), have at least only simple, perfectly reversible data
> > transformations in database operations. No dots that magically
> > appear or disappear, no variable number of tokens, etc. It's always
> > time to do this at the formatting step.
> That's too late as I pointed out elsewhere. You need the normalization
> when you enter the data into the database to have a consistent and
> reliable way to search names.
No it's not too late: you can also play the same game with dots and
spaces later at search/formatting time, without subtly and silently
modifying the data that the user intently input; that is losing
information really.
It's just about where sits this "clever" code.
> > ... and this normalization is too complex to be automated, since
> > no program can correctly handle all particular cases, thus it should
> > be manually carried out by operators.
> > I guess this is already the way it goes in most real cases today?
> >
>
> So if you want to import 100 references that a nice colleague just
> sent you, you start adding/removing spaces and dots from somewhere
> between 100 and 1000 author names? Problematic as it may be in border
> cases, this is a job that *asks* to be automated.
Yes, but as you said above:
> You lose the information that a human brain can put into parsing the
> name string, using cultural background information that is hard if not
> impossible to teach to a machine.
so maybe the conclusion is that it should be "computer-assisted",
instead of "fully automated" ?
Please do never silently and subtly modify user data. At least ask for
confirmation! The real world is too complex for any "clever" names
standardization algorithm.
>=A0If it fails in too many cases, we have to improve the code.
OK: I suggest one *extremely* simple improvement to this code: the
ability to disable it, at least at configure time (I will code this
for myself in any case).
> > But searching for :AU:=3D3D"Miller,A.*M.*" will give a pretty good r=
esult,
> > and reveal to the operator the manual normalization work that must b=
e
> > completed.
> >
>
> This is what a reference manager should avoid at all costs. Why on
> earth should a user be forced to use regular expressions just to find
> references by author names? If this is necessary the data model is
> flawed.
I made a discovery: the real-world data model for international names
is flawed, at least beyond the "family" and "given" name distinction.
Some people even make this more fuzzy by not signing with precisely
the same character strings each time. And worst of all: different
databases try to "standardize" this naming mess... in different ways!
Should we also "normalize" the reality for the please of
bibliographers? I prefer not to wait this long, live with it, and
learn to use jokers while doing name searches; I guess that's what
everyone is already doing today.
Cheers,
--=20
Marc A.Yves Herbert :-)
|
|
From: Bruce D'A. <bd...@fa...> - 2004-01-07 20:00:34
|
On Jan 7, 2004, at 2:47 PM, Markus Hoenicka wrote: > I'm afraid not even this would help. It's not that the publishers > wouldn't care about the names of the cited authors, but to them a > consistent bibliography formatting is more important than individual > wishes. But the issue at hand is separating out metadata concerns (proper coding of data) from formatting concerns (proper formatting of data). They are not the same, and having well-parsed and accurate authoritative data would certainly help matters. Bruce |
|
From: Markus H. <mar...@mh...> - 2004-01-07 20:24:56
|
Bruce D'Arcus writes: > > On Jan 7, 2004, at 2:47 PM, Markus Hoenicka wrote: > > > I'm afraid not even this would help. It's not that the publishers > > wouldn't care about the names of the cited authors, but to them a > > consistent bibliography formatting is more important than individual > > wishes. > > But the issue at hand is separating out metadata concerns (proper > coding of data) from formatting concerns (proper formatting of data). > They are not the same, and having well-parsed and accurate > authoritative data would certainly help matters. > Ah, I see. Good point. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <Mar...@fr...> - 2004-01-09 16:59:06
|
On Wed, 7 Jan 2004, Markus Hoenicka wrote: > The dot is no information. It is formatting. Please separate data from > formatting. Generally speaking, I don't think there is a sharp line between "formatting" and "information". Most of the time, formatting carries information, it's a way to represent information. Of course, margin sizes is quite far from that. But for instance, at the beginning of most computer books you'll find something like this: - text using this font <courier> *means* that it is... - etc. Would you say that punctuation for instance is formatting? Most people don't call punctuation "formatting", because punctuation is much more the responsibility of the author than of the publisher, punctuation is much more about _meaning_ than esthetics, see for instance: "Why Learn to Punctuate?" <http://www.cogs.susx.ac.uk/local/doc/punctuation/node02.html> (emphasized words by me) If your reader has to wade through your strange punctuation, she will have trouble following your *meaning*; at worst, she may be genuinely unable to understand what you've written. If you think I'm exaggerating, consider the following string of words, and try to decide what it's supposed to *mean*: We had one problem only Janet knew we faced bankruptcy Have you decided? Now consider this string again with differing punctuation: We had one problem: only Janet knew we faced bankruptcy. We had one problem only: Janet knew we faced bankruptcy. We had one problem only, Janet knew: we faced bankruptcy. We had one problem only Janet knew we faced: bankruptcy. Are you satisfied that all four of these have completely different *meanings*? So punctuation seems quite far from formatting... if not formatting then punctuation is data? What is the purpose of punctuation? To provide structure to sentences and paragraphs. Now what is the purpose of "formatting" headings and indentation? The same: to represent structure information, just at a higher level. So we have on one side of the line of the "classification": data, and on the other side: formatting, but both with the exact same purpose! I find this weird. The only difference I=A0see, is that punctuation is standardized since ages, while there are many different ways to format headings. So maybe we have a classification criterion here: "standardized" information is data, while "formatting" is choice? Then let's get back for a second to the use of a period as a mean to inform about an abbreviation: is it "standardized" or "free"? Some stylesheets want to enforce a standard about this (in one way or the other). Others do not dare to touch to this; they leave the decision to the author. Should a database be on the "enforcers" side, or stay neutral? In a good work relationship between a publisher and an author, there is no clear demarcation line between the work of each other. Oh sure, the author should never tell the publisher about the sizes of margins, nor the publisher should ever tell the author about his formulas, but there are a whole lot of things less clearly separated than that. A publisher may often correct orthography and grammar (which is also structure, btw). Is this a "formatting" job? Of course, the issue at stake here is "just" about names, and not about formatting in general, so let's forget most of the above. Nevertheless, I wanted to underline that an opposition between data and formatting is not an "evidence" generally speaking. For those who want to know a bit more about "why the human brain can not cope with unstructured information", I suggest this very famous article <http://www.well.com/user/smalin/miller.html#recoding> Cheers, Marc. |
|
From: Bruce D'A. <bd...@fa...> - 2004-01-09 17:09:25
|
On Jan 9, 2004, at 11:59 AM, Marc Herbert wrote: > So punctuation seems quite far from formatting... if not formatting > then punctuation is data? What is the purpose of punctuation. Punctuation with respect to bibliographic formatting is different that otherwise though, because it is often strictly defined in the style itself. Bibliographic data is more regular than written text... Bruce |
|
From: Rich S. <rsh...@ap...> - 2004-01-09 17:28:43
|
On Fri, 9 Jan 2004, Marc Herbert wrote: > See: DJ=A0Delorie > <http://www.delorie.com/users/dj/> > Please note that my legal first name really is "DJ". It is not > correct to insert a space between the D and the J, or to put periods > after them as if they were initials, or to make either of them lower > case. They are not initials, and I have no middle name. Honest. > > Many computers will automatically replace DJ with something else, > thinking that the operator typed it in wrong. They should be > considered broken, and repaired. I usually encounter these in credit > card companies and utilities. Poor guy. Either his parents didn't want a son or they have (had?) a strange sense of humor. Life's tough enough for a kid without adding to his burden. Rich --=20 Dr. Richard B. Shepard, President Applied Ecosystem Services, Inc. (TM) <http://www.appl-ecosys.com> |
|
From: Markus H. <mar...@mh...> - 2004-01-10 01:28:25
|
Marc Herbert writes: > We had one problem only Janet knew we faced bankruptcy > > Have you decided? Now consider this string again with differing > punctuation: > > We had one problem: only Janet knew we faced bankruptcy. > We had one problem only: Janet knew we faced bankruptcy. > We had one problem only, Janet knew: we faced bankruptcy. > We had one problem only Janet knew we faced: bankruptcy. > This is analogous to the punctuation in names as separators in an input format. The parser in your brain needs the punctuation in order to understand the intended meaning of these sentences. The parser in a reference manager needs the periods/spaces in order to understand the name parts (that is, if you use an odd input format like RIS that does not have better means to separate the parts, like XML). That's why they need to be consistent and independent of the personal taste of the person who carries that name. As your example above shows, you can't move punctuation around in English sentences just because you'd personally like to have it somewhere else. The same applies to names. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-14 17:27:16
|
On Sat, 10 Jan 2004, Markus Hoenicka wrote:
> Marc Herbert writes:
> > We had one problem only Janet knew we faced bankruptcy
> >
> > Have you decided? Now consider this string again with differing
> > punctuation:
> >
> > We had one problem: only Janet knew we faced bankruptcy.
> > We had one problem only: Janet knew we faced bankruptcy.
> > We had one problem only, Janet knew: we faced bankruptcy.
> > We had one problem only Janet knew we faced: bankruptcy.
> >
>
> This is analogous to the punctuation in names as separators in an
> input format. The parser in your brain needs the punctuation in order
> to understand the intended meaning of these sentences. The parser in a
> reference manager needs the periods/spaces in order to understand the
> name parts (that is, if you use an odd input format like RIS that does
> not have better means to separate the parts, like XML). That's why they
> need to be consistent and independent of the personal taste of the
> person who carries that name.
Yeah, and that's why periods suck as a separator. Because:
- for RIS, the period is a separator;
- for some publishers, it's a decoration ("formatting")
- for authors, it's a *consistent* way to inform about an
abbreviation, independent of their personal taste (well... not for
Truman, granted)
This is too much asking from the period sign. Clashes are impossible
to avoid. 1st and 2nd dashes co-exist peacefully in refdb. My patch
drops them in favor of dash 3, because I don't care about the _given_
name parsing in refdb. That's all. Don't take offense because I
dropped a part of your code. I still enjoy all the rest very much.
> As your example above shows, you can't
> move punctuation around in English sentences just because you'd
> personally like to have it somewhere else. The same applies to names.
Agreed: a period after a capital means an abbreviation, so a lack of
period means... no abbreviation! You can't just pop it up or down for
RIS- or publishers reasons. This is my point of view. Of course,
the (your) point of view of a RIS parser is totally different and
incompatible with mine. So what?
I am not trying to convince you that "I am right", just trying to make
you discover and understand a different point of view. Please stop
trying to prove that my point of view is non-sense (or do it for
good). It's just a slighty different use of your software. May I?
Cheers,
Marc.
|
|
From: Markus H. <mar...@mh...> - 2004-01-10 01:28:28
|
Marc Herbert writes: > In "Harry S Truman", the S is not an abbreviation. There has been a > debate whether it should nevertheless be written "S." "for the sake > of consistency", at the price of some (admittedly harmless) > information loss. See google. >=20 Citing from the Truman Presidential Museum and Library (http://www.trumanlibrary.org/speriod.htm): "In recent years the question of whether to use a period after the "S" in Harry S. Truman's name has become a subject of controversy, especially among editors. The evidence provided by Mr. Truman's own practice argues strongly for the use of the period. While, as many people do, Mr. Truman often ran the letters in his signature together in a single stroke, the archives of the Harry S. Truman Library has numerous examples of the signature written at various times throughout Mr. Truman's lifetime where his use of a period after the "S" is very obvious." This doesn't mean there can't be other examples of non-abbreviated single-letter middlenames, but Truman apparently is not one of them. >=20 > > An initial is a capital letter by definition. >=20 > But the reverse is wrong. A capital letter is not an initial by > definition. It just may be. A capital letter + a period is an initi= al > by definition. > See: <http://www.cogs.susx.ac.uk/local/doc/punctuation/node28.html> >=20 I disagree. It's an initial followed by an indicator that the previous letter is an abbreviation of something else. The difference should be apparent if we think of an initial as data and what an output format is supposed to do with it. Format 1 outputs initials as they are, format 2 renders them using dots: FM Last F.M.Last The dots (or the lack of dots) are the formatting, the capital letter is the data. > See: DJ=A0Delorie This name is handled gracefully by RefDB. It is not mangled in any way. > This period says: "this letter before stands for an abbreviation". > It's formatting, carrying an information. Some stylesheets may not > care about this information, prefering esthetics, while some others > stylesheets may care. But a database or a format should better stay > _neutral_ and postpone the decision, so to please _everyone_, not ju= st > one side. This is not the point. The RIS format cannot make the distinction. An XML format specifically designed for this purpose will be able to. > The reasons why I do not want to use the RIS middlenames period-base= d > syntax are quite obvious above. Periods carry some information, and > middlenames are culture-specific. >=20 Fine, so let's wait until the MODS-based data model materializes. >=20 > > I'm surprised that this seems new to you. >=20 > Well, I must admit that I found the period-based RIS syntax a bit > weird when discovering it at first in refdb's manual, but > I=A0unfortunately overlooked the potential implications at this time= . > Especially since I did not see it later anywhere else. By the way, d= o > you have pointers to some other "official" RIS=A0specification? Are > others' definition strictly identical? >=20 To the best of my knowledge there is no other official spec except the help files, the PDF manual, and the example databases that they ship with the program. In addition it is helpful to see what the individual styles do to the data upon output. So it's rather reverse engineering than a useful spec. > BTW, how do you avoid false duplicates in this case? By asking every= > bibliographer to use the real, original cyrillic spelling? No, by asking them to settle on one transliteration. Using MODS, however, the cyrillic spelling is an option too as it has some means to carry the transliteration. > >=A0And again, RefDB will not support names that can't be expressed = in > > RIS syntax until a MODS-based data format is implemented. >=20 > Well... my patched version tries to support them :-> It is its main > purpose. >=20 No, it does not support them. The patch prevents that RefDB understands the names. Instead you dumb down the application to a state that it returns the same string that you sent in. However, in order to do anything useful with the names, RefDB must be able to parse them. The patch effectively prevents creating formatted bibliographies and export to all data formats that distinguish name parts. You basically try to send a program written in Perl through a C compiler. You notice that the C parser can't handle the Perl code, so you decide to disable the parser and hope that the compiler will be able to figure out the grammar all by itself. This is not going to work. The only fix is to rewrite the program in C syntax. > Wow... this is becoming harder and harder to understand (I mean: > authorinfo.c r1.4, lines 70s) Could you document this new "hyphenate= d > double initials" format carefully please? Is the hyphen mandatory? I= s > "H.K." now legal input? Or just "H-K" is? Is this RISX output legal= > RISX input? etc. There is not much to document. A firstname and a middlename are abbreviated without a hyphen because there is none in the first place: Franklin Delano -> F.D. (RIS) <firstname>F</firstname><middlename>D</middlename> (RISX) A hyphenated double name retains the hyphen in the initialized form because there is a hyphen in the first place: Karl-Heinz -> K.-H. (RIS) <firstname>K-H</firstname> (RISX) That's all. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-14 17:24:32
|
On Sat, 10 Jan 2004, Markus Hoenicka wrote: > Marc Herbert writes: > > In "Harry S Truman", the S is not an abbreviation. There has been a > > debate whether it should nevertheless be written "S." "for the sake > > of consistency", at the price of some (admittedly harmless) > > information loss. See google. > Citing from the Truman Presidential Museum and Library > (http://www.trumanlibrary.org/speriod.htm): > > "In recent years the question of whether to use a period after the "S" > in Harry S. Truman's name has become a subject of controversy, > especially among editors. The evidence provided by Mr. Truman's own > practice argues strongly for the use of the period. While, as many > people do, Mr. Truman often ran the letters in his signature together > in a single stroke, the archives of the Harry S. Truman Library has > numerous examples of the signature written at various times throughout > Mr. Truman's lifetime where his use of a period after the "S" is very > obvious." > This doesn't mean there can't be other examples of non-abbreviated > single-letter middlenames, but Truman apparently is not one of them. I suggest you go on reading the same page, just a couple of lines further: "In explanation he said that the "S" did not stand for any name" And a bit later: "According to The Chicago Manual of Style all initials given with a name should "for convenience and consistency" be followed by a period even if they are not abbreviations of names." This last sentence says a bunch of interesting things: - stylesheets that "normalize" on "no-periods" are neither "convenient" nor "consistent" according to this manual (Uh?) - the last line shows that the Chicago Manual knows enough cases of single-letter in names, besides Truman's, to have considered this issue. That answers one of your questions. - the need for this justification for the "all-periods" policy shows that it was not obvious and that there has been a debate; please tell about what, if not loss of information? And finally, it's a recommandation made by a manual of *style*, and not a recommandation about the design of bibliographic databases. > > > > > An initial is a capital letter by definition. > > > > But the reverse is wrong. A capital letter is not an initial by > > definition. It just may be. A capital letter + a period is an initial > > by definition. > > See: <http://www.cogs.susx.ac.uk/local/doc/punctuation/node28.html> > I disagree. It's an initial followed by an indicator that the previous > letter is an abbreviation of something else. The difference should be > apparent if we think of an initial as data and what an output format > is supposed to do with it. Format (1) outputs initials as they are, > format (2) renders them using dots: > (1) DJ Last > (2) D.J.Last ... which clearly shows that format (1) lost the information that 'D' and 'J' are abbreviations. You have to guess it. Easy for 99% of names. And the remaing 1% does not matter: these people should probably better get a life and normalize their names anyway... By the way, I am wondering what "uppercase" means in unicode. > > See: DJ Delorie > > This name is handled gracefully by RefDB. It is not mangled in any > way. Sure! But the topic here was "is a period information?" (see the Subject:). This example just demonstrates the issue with stylesheets that think period are just formatting and suppress them, losing information. They can not make the difference between: DJ Delorie -> DJ Delorie (OK) and Dorothy J. Delorie -> DJ Delorie (loss of periods/information) I obviously never asked you to correct these stylesheets! In fact, I even gave up asking you to change _anything_ in refdb about this since quite a time (excepted some added words in the documentation). I just recently sent on the list a patch not to lose the information in advance in the database, with admitted and documented short-comings. Then you felt complied to demonstrate this patch is the apocalypse, giving to it probably more attention and publicity that it deserves. > > > And again, RefDB will not support names that can't be expressed in > > > RIS syntax until a MODS-based data format is implemented. > > > > Well... my patched version tries to support them :-> It is its main > > purpose. > No, it does not support them. The patch prevents that RefDB > understands the names. Instead you dumb down the application to a > state that it returns the same string that you sent in. However, in > order to do anything useful with the names, RefDB must be able to > parse them. The patch effectively prevents creating formatted > bibliographies and export to all data formats that distinguish name > parts. At least half of this is plain wrong. My patch only prevents RefDB to parse the GIVEN NAME, and only FROM RIS INPUT. I suggest you read this page: <http://marc.herbert.free.fr/refdb/reversible/> that describes the patch with decent accuracy, contrary to your paragraph above. Reading it may be useful if you want to make comments (but you don't have to make comments). Cheers, Marc |
|
From: Markus H. <mar...@mh...> - 2003-12-11 21:04:01
|
Marc Herbert writes: > I am glad to hear this! Then fix the _automated_ RIS parsing/syntax by > adding a comma to it? > Where would you like to have an additional comma? I'd be reluctant to do this anyway as this would break data import from RefMan and EndNote, but the RIS syntax uses two commas anyway. One to separate the last name from the rest, and one to separate the suffix from the rest. > I did not know publishers of 5000 life sciences journals where so > english-centric and ignorant of foreign cultures. This bug is quite > amazing. > It's sad but I don't see it as my job to change this. > > >=A0I also believe that your argument is moot that if a > > style requires the concept of middle names it should be able to > > retrieve the middle name by itself. With the same argument you could > > dump entirely unparsed strings in any order onto a bib software and > > expect it to figure out how to parse it, as it requires to disginguish > > between given and family names, titles and suffixes. > > If I remember well, this discussion is about the right level of detail > to adopt and where. So I find "With the same argument you could dump > entirely unparsed strings" not very constructive. > I just wanted to point out that it does not make much sense to me to code half-parsed strings in XML when you have to parse anyway. Why not go the extra inch and do it right? > This life-science-specific "middlename parsing" could be factorized > without being put down to the database. So refdb could be used > internationally without bugs and hassles: just by working around it. > Why not adding a "-[no]middlename" option for outputs ? > > Same thing for the "clever" abbreviating code. > The middlename handling and abbreviating stuff is not at your discretion. If a style requires these modifications it does not make any sense to add a switch that will produce incorrect data. > No it's not too late: you can also play the same game with dots and > spaces later at search/formatting time, without subtly and silently > modifying the data that the user intently input; that is losing > information really. > That is, re-parse the name string each time a query comes in? It couldn't come any worse. > Please do never silently and subtly modify user data. At least ask for > confirmation! The real world is too complex for any "clever" names > standardization algorithm. I'll be happy to add a section to the docs in all caps and a red box around it stating that author names will be normalized for the sake of consistency. > OK: I suggest one *extremely* simple improvement to this code: the > ability to disable it, at least at configure time (I will code this > for myself in any case). > This does not make sense as it breaks consistent searching and the bibliography formatting. Otherwise this is an example of the beauty of free software. If you code this for yourself, everyone can have it his way. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@en...> - 2003-12-19 22:44:30
|
On Thu, 11 Dec 2003, Markus Hoenicka wrote: > Marc Herbert writes: > > I am glad to hear this! Then fix the _automated_ RIS parsing/syntax = by > > adding a comma to it? > Where would you like to have an additional comma? I'd be reluctant to > do this anyway as this would break data import from RefMan and > EndNote, but the RIS syntax uses two commas anyway. One to separate > the last name from the rest, and one to separate the suffix from the > rest. I was suggesting a comma between each firstname or middlename, in order to have (at least) the same middlename data model in both RISX=A0and RIS, and an un-ambiguous RIS=A0syntax. The reason while it would break import from RefMan seems quite obvious to me: according to this documentation, RefMan does NOT=A0support so-called "middlenames". <http://www.refman.com/support/risformat_tags_02.asp> "For Firstname, you can use full names, initials, or both." How do people in life sciences work with RefMan? It would be interesting to know. > > I did not know publishers of 5000 life sciences journals where so > > english-centric and ignorant of foreign cultures. This bug is quite > > amazing. > > > > It's sad but I don't see it as my job to change this. So be happy that: it's absolutely not what I was asking for (see previous messages). > I just wanted to point out that it does not make much sense to me to > code half-parsed strings in XML when you have to parse anyway. Why not > go the extra inch and do it right? Because the concept of middlenames is not part of any data model (except risx), but only of some specific _formatting_ needs. > The middlename handling and abbreviating stuff is not at your > discretion. If a style requires these modifications it does not make > any sense to add a switch that will produce incorrect data. Yes, because other styles will require something else. Thus a "switch" to satisfy all of them. The "--[not]-life-sciences" switch :-) > > No it's not too late: you can also play the same game with dots and > > spaces later at search/formatting time, without subtly and silently > > modifying the data that the user intently input; that is losing > > information really. > That is, re-parse the name string each time a query comes in? It > couldn't come any worse. I found very interesting to note that this "so bad" re-parsing is exactly what happens in _today's_ code, in the case of several middlenames. Search for "strtok" in: <http://cvs.sourceforge.net/viewcvs.py/*checkout*/refdb/refdb/src/backend= -risx.c?content-type=3Dtext%2Fplain&rev=3D1.20> I know: you will change this later. But still, it seems to work today. > > Please do never silently and subtly modify user data. At least ask f= or > > confirmation! The real world is too complex for any "clever" names > > standardization algorithm. > > I'll be happy to add a section to the docs in all caps and a red box > around it stating that author names will be normalized for the sake of > consistency. Thanks in advance! (I consider this a minimum before modifying user data). > > OK: I suggest one *extremely* simple improvement to this code: the > > ability to disable it, at least at configure time (I will code this > > for myself in any case). > This does not make sense as it breaks consistent searching and the > bibliography formatting. "Consistent searching" across...=A0different refdb installations !? > Otherwise this is an example of the beauty of free software. If you > code this for yourself, everyone can have it his way. Sure ! I will, I will... Time for a "contrib/" directory ? :-) Cheers, Marc. |
|
From: Markus H. <mar...@mh...> - 2003-12-20 23:39:40
|
Hi Marc, Marc Herbert writes: > The reason while it would break import from RefMan seems quite obvious > to me: according to this documentation, RefMan does NOT=A0support > so-called "middlenames". > <http://www.refman.com/support/risformat_tags_02.asp> > "For Firstname, you can use full names, initials, or both." > You'll have to look a little closer than that, and maybe get some hands-on experience with these kinds of tools. Middle names are supported implicitly by assuming the first non-lastname is the first name and any other non-lastname is a middle name. This is e.g. very apparent if you look at the RefMan style definitions which support the formatting of last, first, and middle names (using exactly these terms). > > I just wanted to point out that it does not make much sense to me to > > code half-parsed strings in XML when you have to parse anyway. Why not > > go the extra inch and do it right? > > Because the concept of middlenames is not part of any data model > (except risx), but only of some specific _formatting_ needs. > We're running in circles, I guess. These specific formatting needs imply that your data models allows to distinguish the parts of the data which need to be formatted differently. You would never expect the DocBook stylesheets to format a plain text file successfully, but for some reason you expect this for author names given more or less as plain text. > > > The middlename handling and abbreviating stuff is not at your > > discretion. If a style requires these modifications it does not make > > any sense to add a switch that will produce incorrect data. > > Yes, because other styles will require something else. Thus a "switch" > to satisfy all of them. The "--[not]-life-sciences" switch :-) > I don't see your point here. If all non-life sciences applications do not require the distinction between first and middle names, their style specifications will be a little simpler, that's all. > > That is, re-parse the name string each time a query comes in? It > > couldn't come any worse. > > I found very interesting to note that this "so bad" re-parsing is > exactly what happens in _today's_ code, in the case of several > middlenames. Search for "strtok" in: > <http://cvs.sourceforge.net/viewcvs.py/*checkout*/refdb/refdb/src/backend= > -risx.c?content-type=3Dtext%2Fplain&rev=3D1.20> > > I know: you will change this later. But still, it seems to work today. > No, I was talking about the SQL query that tries to match the incoming query against the available datasets. This is currently done against the normalized representation of the full name. No re-parsing happens at this stage as it would grossly affect the performance. Things are a little different if we're talking about generating output from these data. Middle names are currently stored as a list of tokens in a single field. I believe (that is, I didn't run any benchmarks) that tokenizing this list for those backends that actually require this is faster than using an additional table plus joins for all backends, even for those that don't bother. The backends that you'll be using most of the time (scrn or html for locating references) use the normalized representation and hence to not tokenize the middle name list. > > > OK: I suggest one *extremely* simple improvement to this code: the > > > ability to disable it, at least at configure time (I will code this > > > for myself in any case). > > > This does not make sense as it breaks consistent searching and the > > bibliography formatting. > > "Consistent searching" across...=A0different refdb installations !? > Consistent searching across all names. > > > Otherwise this is an example of the beauty of free software. If you > > code this for yourself, everyone can have it his way. > > Sure ! I will, I will... > > Time for a "contrib/" directory ? :-) > I'd be very reluctant to add code to a contrib directory that would not work with the rest of the application. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-05 22:44:30
Attachments:
reversible_names094-pre3.patch
|
> > I'll be happy to add a section to the docs in all caps and a red box > > around it stating that author names will be normalized for the sake o= f > > consistency. I could not find this yet in <http://refdb.sourceforge.net/manual-0.9.4/book1.html> Explaining "how" they are normalized also seems rather vital to me. > > > OK: I suggest one *extremely* simple improvement to this code: the > > > ability to disable it, at least at configure time (I will code thi= s > > > for myself in any case). > > Otherwise this is an example of the beauty of free software. If you > > code this for yourself, everyone can have it his way. It's done. See: <http://marc.herbert.free.fr/refdb/reversible/> or below/attached. Comments welcome (including from you, Markus :-) BTW, while testing and comparing, I found some quirks that do not seem to fit _any_ logic (as opposed to: not fit my taste). Cheers, Marc. ---------------------------------------------------------- The "reversible" refdb patch Marc Herbert $Date: 2004/01/05 21:30:50 $ $Revision: 1.2 $ ---- The issue ---- Currently, refdb tries to "normalize" authors' name inputed in the database, in order to avoid false duplicates and maybe to cope with weird requirements of some bibliographic stylesheets. This means fiddling with full stops and so-called "middlenames". I think refdb should either reliably perform this normalization according to a documented, reviewed and formal specification -- or not at all. Today it does it in an undocumented way, silently modifying some user data with potential information loss in corner cases. This (short and simple) refdb patch disables all modifications of user-data, and lets the user decide by himself how names should be "normalized" (assuming it's both desirable and possible). Thanks to it, what gets _in_ refdb, gets _out_ untouched. For instance, if you enter "Harry S Truman" in refdb, you would get back: - without this patch: "Harry S. Truman" - with this patch: "Harry S Truman" (amazing! and "reversible"...= ) Warning: this patch may or may not break further formatting by some bibliographic stylesheets, depending if they expect "normalized" names from the database. I do not care much about breaking stylesheets that want you to change the way you write your name (probably in a more "english" way). I do not mind if they munge names when formatting for publication, but pushing this "normalization" up to the database is not acceptable to me. After all, respectful and less rigid formatting tools also (co-)exist. The answer to this question is likely to be in the following function: backend-dbiba.c:format_firstmiddlename() By the way, be aware that you should NOT use spaces at the beginning or at the end of RISX <name>(s), since this will lead to false duplicates in the database _independently from this patch_. On the other hand, RIS input (AU - field) is more or less space-insensitive. This patch is compatible with version 0.9.4-pre3, and _not_ with version 0.9.3. Users (yet...) satisfied with current refdb behaviour and thus not directly interested by this patch, may still be interested in understanding how their data is modified; just having a look at this patch will provide detailed answers. The summary of changes just below also explains (in english instead of C). This patch also disables middlename(s) input in the RIS format, due to a flawed RIS input syntax, and due to their controversial nature (see http://sourceforge.net/mailarchive/forum.php?forum_id=3D1798&viewmonth=3D= 200312); all RIS "given names" go together untouched into the "firstname" database field. On the other hand, RISX <middlename>s are not disabled by this patch. To disable middlenames in RISX, just... don't use the tag <middlename>. ---- Detailed issues and modifications ---- The SQL database uses 4 (redundant) fields to store author names: fullname, lastname, firstname, middlenameS __________________________ Modifications to RIS input (i.e., "addref -t ris") firstname/middlenames parsing is disabled. - the patch disables fiddling with full stops. - middlenames are disabled: inside the AU field, the whole "given name" as delimited by commas, goes into the "firstname" database field. RIS input examples Smith, F.M.N. Chu, H.K. Jerry Truman, Harry S -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F.M.N." "Smith" "F.M.N." official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" official : "Truman,Harry S." "Truman" "Harry" "S " patched : "Truman,Harry S" "Truman" "Harry S" (also notice the spurious space ending some middlenames with the official version). ____________________________ Mmodifications to RISX input (i.e., "addref -t risx") - full stops "tricks" are disabled RISX input examples "Smith" "F." "M." "N." "Truman" "Harry" "S" "Chu" "H.K." "Jerry" -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F. M. N." "Smith" "F." "M. N." official : "Truman,Harry S." "Truman" "Harry" "S" patched : "Truman,Harry S" "Truman" "Harry" "S" official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informatio= n loss!) patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" _______ Outputs No output expect bibtex's is modified. RIS output dumps "as is" the first field of the SQL database (fullname). RISX output uses the 3 other fields (last, first, middles). It dumps last and firstname untouched, then parse the "middlenames" field according to spaces before dumping <middlename>s elements. The patch does modify neither RIS nor RISX output. Most other outputs also work one way or the other, and are not modified by the patch. However, for some unknown reason, bibtex output pulls the fullname from the database and parses it again, so a small patch was needed here again to prevent the addition of full stops. __________ Convertors The "nmed2ris" convertor also fiddles with authors' names in a similar way. I can not yet say more about this, sorry: I do not use the MED=A0format at all and could not have tested modifications. ________ Feedback Since all this is unfortunably complicated, the probability that I missed something despite all my efforts is non-zero. I thank you in advance for any feedback. ___________________________ The art of Unix Programming Some food for thought from: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Transparency: design for visibility to make inspection and debugging easier. For a program to demonstrate its own correctness, it needs to be using input and output formats sufficiently simple so that the proper relationship between valid input and correct output is easy to check. Rule of Least Surprise: In interface design, always do the least surprising thing. |
|
From: Markus H. <mar...@mh...> - 2004-01-06 15:40:19
|
Marc Herbert writes: >=20 > > > I'll be happy to add a section to the docs in all caps and a red= box > > > around it stating that author names will be normalized for the s= ake of > > > consistency. >=20 > I could not find this yet in > <http://refdb.sourceforge.net/manual-0.9.4/book1.html> >=20 > Explaining "how" they are normalized also seems rather vital to me. >=20 Sorry. Didn't get round to it yet. > BTW, while testing and comparing, I found some quirks that do not se= em > to fit _any_ logic (as opposed to: not fit my taste). >=20 In this case you should document them and file a bug report instead of spreading FUD. > ---------------------------------------------------------- >=20 > The "reversible" refdb patch >=20 > Marc Herbert > $Date: 2004/01/05 21:30:50 $ > $Revision: 1.2 $ >=20 >=20 > ---- The issue ---- >=20 > Currently, refdb tries to "normalize" authors' name inputed in the > database, in order to avoid false duplicates and maybe to cope with > weird requirements of some bibliographic stylesheets. This means > fiddling with full stops and so-called "middlenames". >=20 > I think refdb should either reliably perform this normalization > according to a documented, reviewed and formal specification > -- or not at all. Today it does it in an undocumented way, > silently modifying some user data with potential information loss in= > corner cases. >=20 The purpose of the name mangling is to reduce all names consistently to the RIS input format. This is currently the common denominator of both RIS and RISX input until a richer data format like MODS is implemented. If the name mangling is not consistent, then it is a bug that needs to be fixed, not a feature that needs to be removed. The bottom line is: if you supply your RIS data according to the RIS input format, they won't be fiddled with at all. If you use a different format, e.g. by leaving out periods or by adding random spaces, RefDB attempts to mangle the data until they fit the RIS input format. This works in many cases, but may fail in border cases. The important thing to understand is that the dots and spaces used in the RIS input format do not have anything to do with the final representation of a name in a formatted bibliography. The sole purpose of the dots and spaces is to separate the name parts in order to tell the parser where to chop. You could use slashes or question marks just as well. As it is the job of a bibliography software to output the author names in all possible formatting variations, it is essential not to store pre-formatted data in the database. However, it may be useful (see below) to store pre-parsed data. The same principle basically applies to the RISX input format. However, the RISX format provides separate elements for the name parts, so there is no need for textual separators at all. There is no point to enter a middle initial as <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." is a representation of a middle name which is used in some bibliography styles (others don't use the dot or leave out the middle name altogether) and can be trivially generated from "B". Therefore, a <middlename>B</middlename> is all you need. If RefDB detects the superfluous dot, it will remove it. > This (short and simple) refdb patch disables all modifications of > user-data, and lets the user decide by himself how names should be > "normalized" (assuming it's both desirable and possible). > Thanks to it, what gets _in_ refdb, gets _out_ untouched. > For instance, if you enter "Harry S Truman" in refdb, you would get = back: > - without this patch: "Harry S. Truman" > - with this patch: "Harry S Truman" (amazing! and "reversibl= e"...) >=20 Now we get to the purpose of normalization. As stated above, the data in the AU field of a RIS dataset or an <author> element are not strings that are supposed to be inserted into a bibliography as they are. They are input formats that supply data (the name parts) for one object in the database (an author). If an author has several reference entries in the database, these entries must link to the same object (the author), not to a specific representation of the author's name. Assume the following cases: Truman,Harry S. Truman,Harry S Truman, Harry S Truman, Harry S. The first one is what the RIS input format asks for. The others aren't that different except for a space or a dot here and there. If these belong to four references among 100, you probably wouldn't even notice that the author names are written differently, although it is clear that they mean the same author. If you add these four datasets to RefDB, the first entry won't be mangled at all (as it sticks to the rules). The other entries are normalized, and as a consequence, all four references link to the same author. The normalized internal representation of the author name is "Truman,Harry S." (amazing! and "reversible"...). If you go ahead and prevent this normalization, the four references will point to four different author objects, one with the representation "Truman,Harry S.", another one with the representation "Truman,Harry S", and so forth. If you now run a query for references by some "Truman,Harry S.", you'll miss 75% of the possible hits. This is not good. You can obviously work around this weakness of the patch by running all queries against regular expressions, but this is not an option if you design a simplified interface that allows users to pick names from a list (something Mike is currently working on). > Warning: this patch may or may not break further formatting by some > bibliographic stylesheets, depending if they expect "normalized" nam= es > from the database. I do not care much about breaking stylesheets tha= t > want you to change the way you write your name (probably in a more > "english" way). I do not mind if they munge names when formatting f= or > publication, but pushing this "normalization" up to the database is > not acceptable to me. After all, respectful and less rigid formattin= g > tools also (co-)exist. The answer to this question is likely to be = in > the following function: backend-dbiba.c:format_firstmiddlename() This is the key point why we have to argue at all. You do not understand that the database does not contain a formatted string that shows how you would like to see your name printed on a piece of paper. The database contains the name parts, plus a normalized representation for speeding up queries that happens to look like some formatted representation. When creating a bibliography, RefDB then has to assemble the name parts in a fashion that matches the requirements of the publisher. It is irrelevant how the cited author or the author writing the paper would like to represent that name. >=20 > By the way, be aware that you should NOT use spaces at the beginning= > or at the end of RISX <name>(s), since this will lead to false > duplicates in the database _independently from this patch_. On the > other hand, RIS input (AU - field) is more or less space-insensitive= . >=20 The RIS input is insensitive to leading and trailing spaces as the latter are basically invisible in this input format. I have not anticipated that anyone would add stray spaces to XML elements as they are easily detected, but if this is a common problem it could be handled just as well. >=20 >=20 > The SQL database uses 4 (redundant) fields to store author names: > fullname, lastname, firstname, middlenameS >=20 The columns are not redundant. Redundancy implies that they hold the same information but this is not the case. author_lastname, author_firstname, and author_middlename hold the pre-parsed name parts which are different by definition. The author_name field holds the normalized representation of the full name or a corporate name. The latter doesn't have name parts but it can't go into e.g. author_lastname either as we have to distinguish between authors that have only one name and corporate names. The only redundancy in this setup is that a non-corporate name could be assembled from the name parts. However, author names are usually added once and then queried each time someone requests a reference or a bibliography containing that name. For the sake of speed it makes sense to parse the name once (when you add it) instead of each time it is retrieved. > __________________________ > Modifications to RIS input > (i.e., "addref -t ris") >=20 [...] > RIS input examples >=20 > Smith, F.M.N. > Chu, H.K. Jerry > Truman, Harry S >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F.M.N." "Smith" "F.M.N." >=20 > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" >=20 > official : "Truman,Harry S." "Truman" "Harry" "S " > patched : "Truman,Harry S" "Truman" "Harry S" >=20 > (also notice the spurious space ending some middlenames with the > official version). These spaces are due to a bug introduced after adding support for multiple middle names. Fixed in CVS. Please note that the last output of the patched version does not follow the RIS specs, therefore it is not clear whether RefMan, EndNote and the like import this properly. >=20 >=20 > ____________________________ > Mmodifications to RISX input > (i.e., "addref -t risx") >=20 > - full stops "tricks" are disabled >=20 As stated above, you should not use periods anyway as they are not required. Following this simple rule will make most of your complaints obsolete. > RISX input examples >=20 > "Smith" "F." "M." "N." > "Truman" "Harry" "S" > "Chu" "H.K." "Jerry" >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F. M. N." "Smith" "F." "M. N." Whether or not to use spaces after initials is a formatting issue that is handled by the bibliography style. A period is enough as a separator for the internal representation. The spaces are redundant and bloat the data without a reason. >=20 > official : "Truman,Harry S." "Truman" "Harry" "S" > patched : "Truman,Harry S" "Truman" "Harry" "S" >=20 Again, the patched output may not be readable by other tools using RIS.= > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (infor= mation loss!) > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" >=20 Please provide the RISX input that you used for this example. The following input works just fine for me without any loss of data: <author> =09<lastname>Chu</lastname> =09<firstname>H</firstname> =09<middlename>K</middlename> =09<middlename>Jerry</middlename> </author> (the markup is odd but RISX currently does not support something like a "prime" given name which is not in the first position, as in "M. Steven Miller". RIS does not support this either, so this will be handled properly only by the forthcoming MODS-like data model) Please note that in the official examples given above, most of the output is correct although an improper input format was used. This is what normalization is all about. The only problem that I've come across while looking at these examples is that the current implementation does not handle abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, but "Schleifer,K.-H." will cause problems to the best of my knowledge. I'll look into this and fix it if necessary. > However, for some unknown reason, bibtex output pulls the fullname > from the database and parses it again, so a small patch was needed > here again to prevent the addition of full stops. >=20 The "unknown reason" is negligence. I haven't heard positively of anyone using the bibtex output, so this gets somewhat less attention than it should. >=20 > __________ > Convertors >=20 > The "nmed2ris" convertor also fiddles with authors' names in a simil= ar > way. I can not yet say more about this, sorry: I do not use the > MED=A0format at all and could not have tested modifications. >=20 It is clearly stated in the manual that this program is obsolete and will eventually be removed from the distribution. If at all, have a look at the med2ris.pl script. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-07 14:26:42
|
On Tue, 6 Jan 2004, Markus Hoenicka wrote: > > The purpose of the name mangling is to reduce all names consistently > to the RIS input format. This is currently the common denominator of > both RIS and RISX input until a richer data format like MODS is > implemented. This is one of the core issues indeed (thanks for starting with it :-) The idea of trying to avoid false duplicates is great, even if it will never be 100% reliable and always depend for some part from the human typist (think for instance about: abbreviated or full input?). But using the RIS input format to implement it is so wrong to me that I prefer to forget about it for the moment. It's a tradeoff, and you surely can understand that I have a different perspective than you about it. Knowing in advance that our opinions differ, I posted this patch on my web page instead of in some sourceforge tracker. I can't see any FUD=A0in this, just a different use of your software. In some "MODS" future, if a name reduction format/scheme that I trust is available, then I will be happy to give it my data. Meanwhile, I prefer to keep it intact, since it does not fit into the RIS format. > If the name mangling is not consistent, then it is a bug > that needs to be fixed, not a feature that needs to be removed. Great! Unfortunately, if the syntax of the target format (RIS) is flawed from the start, you cannot achieve full consistency, whatever your efforts are. Moreover, even "half-consistency" becomes harder and prone to overcomplicated code and bugs, as we see. > The bottom line is: if you supply your RIS data according to the RIS > input format, they won't be fiddled with at all. If you use a > different format, e.g. by leaving out periods or by adding random > spaces, RefDB attempts to mangle the data until they fit the RIS input > format. This works in many cases, but may fail in border cases. This is crystal clear. Now my point: I care about border cases, and I don't care about false duplicates. So I disable mangling. Simple! This is a bit exaggerated, but you get the point. By the way, about border cases: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Repair: Repair what you can -- but when you must fail, fail noisily and as soon as possible. Software should be transparent in the way that it fails, as well as in normal operation. It's best when software can cope with unexpected conditions by adapting to them, but the worst kinds of bugs are those in which the repair doesn't succeed and the problem quietly causes corruption that doesn't show up until much later. Sorry, I do not have so many different..=A0references :-) >=A0The important thing to understand is that the dots and spaces used in t= he > RIS input format do not have anything to do with the final > representation of a name in a formatted bibliography. > The sole purpose of the dots and spaces is to separate the name > parts in order to tell the parser where to chop. The important thing to understand is that dots may be meaningful to some author name in some language (including english), so they are the not far from the worst separator ever. > You could use slashes or question marks just as well. If the designer of the RIS=A0format had had some clue, he could have spared us a lot of discussion and time. >=A0As it is the job of a bibliography software to output the > author names in all possible formatting variations, it is essential > not to store pre-formatted data in the database. > However, it may be useful (see below) to store pre-parsed data. Great, something we agree about! :-) > The same principle basically applies to the RISX input > format. However, the RISX format provides separate elements for the > name parts, so there is no need for textual separators at all. There > is no point to enter a middle initial as > <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." The middlename maybe "B" without being an initial. More generally, the existence or the non-existence of the dot maybe an information that some refdb user does not want to lose, at least not in the database (even if he does not care about some formatted output). > is a representation of a middle name which is used in some > bibliography styles (others don't use the dot or leave out the middle > name altogether) and can be trivially generated from "B". Therefore, a > <middlename>B</middlename> is all you need. If RefDB detects the > superfluous dot, it will remove it. I am really hopeless about making you understand how and why I disagree with: "the superfluous dot". Can't you just accept it as a fact? I also disagree with the "middlename" concept, but this was another story :-) > This is the key point why we have to argue at all. You do not > understand that the database does not contain a formatted string that > shows how you would like to see your name printed on a piece of > paper. > The database contains the name parts, plus a normalized > representation for speeding up queries that happens to look like some > formatted representation. When creating a bibliography, RefDB then has > to assemble the name parts in a fashion that matches the requirements > of the publisher. > It is irrelevant how the cited author or the author > writing the paper would like to represent that name. You really do not understand that, if some information is lost in this great process, whatever its noble purpose is, some author may _never_ see his name printed as he would like to, even when some stylesheets allow it. I will suggest in a next message (this one is already too long and too chaotic, and I still have to think a bit about it) a better solution that may please everyone (no more trade-off). Assuming you understand that I have slighty different refdb needs, so we can discuss about it. > > __________________________ > > Modifications to RIS input > > (i.e., "addref -t ris") > > > > [...] > > > RIS input examples > > > > Smith, F.M.N. > > Chu, H.K. Jerry > > Truman, Harry S > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F.M.N." "Smith" "F.M.N." > > > > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" > > > > official : "Truman,Harry S." "Truman" "Harry" "S " > > patched : "Truman,Harry S" "Truman" "Harry S" > > > Please note that the last output of the patched version does not > follow the RIS specs, therefore it is not clear whether RefMan, > EndNote and the like import this properly. Thanks for this precision. I personally do not care. I obviously never pretended to stay compatible with a format while arguing it is flawed. Once again, that's the reason I did not even tried to put this patch in some sourceforge tracker. > As stated above, you should not use periods anyway as they are not > required. Following this simple rule will make most of your complaints > obsolete. Unfortunately, this simple "no-periods" rule is not acceptable to me. Please do not forget to put it in the documentation, I really think it is important. > > RISX input examples > > > > "Smith" "F." "M." "N." > > "Truman" "Harry" "S" > > "Chu" "H.K." "Jerry" > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F. M. N." "Smith" "F." "M. N." > Whether or not to use spaces after initials is a formatting issue that > is handled by the bibliography style. A period is enough as a > separator for the internal representation. The spaces are redundant > and bloat the data without a reason. A period is not a decent separator, since it may be part of user data. Period. > > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informa= tion loss!) > > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" > > > > Please provide the RISX input that you used for this example. It's just above (below "RISX input"). I used double quotes " instead of XML <tags>, to avoid clutter. > The following input works just fine for me without any loss of data: > > <author> > =09<lastname>Chu</lastname> > =09<firstname>H</firstname> > =09<middlename>K</middlename> > =09<middlename>Jerry</middlename> > </author> It does not work, because the firstname is: "H.K.", while the nickname is "Jerry" (a "nickname" which is by the way a bit far from a so-called "middlename"... anyway) It seems you cannot express "H.K." with the RIS syntax, since it uses the period as a separator. What we see here, is the combination of a culture-specific concept (middlename), with a flawed syntax (period as separator). Maybe you should inform the author he mistyped his name, since it does not conform to the RIS syntax. And oh no, please do not tell me about the ugly and overcomplicated: "H.-K."... > Please note that in the official examples given above, most of the > output is correct although an improper input format was used. This > is what normalization is all about. You explained just above that the main and noble purpose of normalization is to avoid false duplicates. But here you go much further since you: - normalize the data from the typist completely and irreversibly, and so the output. - even ask him to make his _input_ RIS-compliant. > The only problem that I've come across while looking at these > examples is that the current implementation does not handle > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, > but "Schleifer,K.-H." will cause problems to the best of my > knowledge. I'll look into this and fix it if necessary. My patch does a simple fix to this: it drops the period as a separator, using only spaces. That's all. I know it's not RIS-compliant anymore, but I do not care, since I never used it and will never since it=A0is flawed. And I manage false duplicates by hand, which admittedly sucks, but hey, this is only a one-page long patch. By the way, the only specification about RIS names syntax I could find is here: <http://www.refman.com/support/risformat_tags_02.asp> and it says nothing about periods nor middlenames. Do you have a better reference? and... publically available? Thanks for the time to answer, and thanks again for refdb. Quoting you to conclude: > Otherwise this is an example of the beauty of free software. If you > code this for yourself, everyone can have it his way. Cheers, Marc. |
|
From: Markus H. <mar...@mh...> - 2004-01-07 17:56:02
|
Hi, Marc Herbert writes: > using the RIS input format to implement it is so wrong to me that I > prefer to forget about it for the moment. It's a tradeoff, and you Fine with me. The MODS-based data format will allow more flexibility in marking up names, but this does not obviate the need to normalize the names and the need to stick to an input format. The main difference that I see is a full support for "prime given" vs. "given" names regardless of their position. This will hold the distinction between what's currently called first and middle names, but without implying any sequence. > > The bottom line is: if you supply your RIS data according to the RIS > > input format, they won't be fiddled with at all. If you use a > > different format, e.g. by leaving out periods or by adding random > > spaces, RefDB attempts to mangle the data until they fit the RIS input > > format. This works in many cases, but may fail in border cases. > > This is crystal clear. Now my point: I care about border cases, and I > don't care about false duplicates. So I disable mangling. Simple! This > is a bit exaggerated, but you get the point. > No, you don't need to disable mangling. You simply have to supply the names according to the RIS specs, then they won't be mangled. And if you start to use the extended notes, you'll probably start to worry about duplicates in the author table. > >=A0The important thing to understand is that the dots and spaces used in t= > he > > RIS input format do not have anything to do with the final > > representation of a name in a formatted bibliography. > > > The sole purpose of the dots and spaces is to separate the name > > parts in order to tell the parser where to chop. > > The important thing to understand is that dots may be meaningful to > some author name in some language (including english), so they are > the not far from the worst separator ever. I'll await your examples. Abbreviated middle names in the angloamerican culture, so-called middle initials, are not an example for this. An initial is a capital letter by definition. You may represent your middle name in formatted output by appending a dot to the initial, but you don't have to. You can leave out the dot, or spell out your middle name. The initial is the data, the initial plus the dot is one of several possible representations of your middle name, i.e. it contains formatting information that does not belong into a database. > The middlename maybe "B" without being an initial. More generally, > the existence or the non-existence of the dot maybe an information > that some refdb user does not want to lose, at least not in the > database (even if he does not care about some formatted output). The dot is no information. It is formatting. Please separate data from formatting. Roosevelt's middle name was not "D.", therefore "D." cannot appear as a piece of data in the database. Roosevelt's middle name was "Delano". "D." is one of several possible ways to format his middle name. The dot does not convey any additional information even if you know only the initial and not the full name. The border cases like names that consist of a single letter (any examples?) will be handled gracefully only in an XML-based input format like MODS - by providing an appropriate attribute, not by fiddling with dots. > Thanks for this precision. I personally do not care. I obviously never > pretended to stay compatible with a format while arguing it is flawed. > Once again, that's the reason I did not even tried to put this patch > in some sourceforge tracker. I have to care as one of the goals of RefDB was to implement a reference manager that can exchange data with commercial tools. > > Whether or not to use spaces after initials is a formatting issue that > > is handled by the bibliography style. A period is enough as a > > separator for the internal representation. The spaces are redundant > > and bloat the data without a reason. > > A period is not a decent separator, since it may be part of user data. > Period. A period is either a textual separator (in an input format) or formatting (in a printed representation of the name), but no user data. > > The following input works just fine for me without any loss of data: > > > > <author> > > =09<lastname>Chu</lastname> > > =09<firstname>H</firstname> > > =09<middlename>K</middlename> > > =09<middlename>Jerry</middlename> > > </author> > > It does not work, because the firstname is: "H.K.", while the nickname > is "Jerry" (a "nickname" which is by the way a bit far from a > so-called "middlename"... anyway) > > It seems you cannot express "H.K." with the RIS syntax, since it uses > the period as a separator. What we see here, is the combination of a > culture-specific concept (middlename), with a flawed syntax (period as > separator). Maybe you should inform the author he mistyped his name, > since it does not conform to the RIS syntax. And oh no, please do not > tell me about the ugly and overcomplicated: "H.-K."... This depends on how this name spells out. I know he's Chinese but would it spell "Hans Karl" or "Hans-Karl"? And no, nicknames are no part of RIS, but I haven't seen a nickname in a citation either. And again, RefDB will not support names that can't be expressed in RIS syntax until a MODS-based data format is implemented. > > Please note that in the official examples given above, most of the > > output is correct although an improper input format was used. This > > is what normalization is all about. > > You explained just above that the main and noble purpose of > normalization is to avoid false duplicates. But here you go much > further since you: > - normalize the data from the typist completely and > irreversibly, and so the output. > - even ask him to make his _input_ RIS-compliant. > No, I've explained this previously and will do it again: If you input your data according to the specs, they won't be mangled. If you insist on using a different input format, RefDB will do it's best to use these data anyway but may fail in border cases. And it is of course mandatory to provide the input in a RIS-compliant format as the data format is based on RIS. I'm surprised that this seems new to you. > > > The only problem that I've come across while looking at these > > examples is that the current implementation does not handle > > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, > > but "Schleifer,K.-H." will cause problems to the best of my > > knowledge. I'll look into this and fix it if necessary. > > My patch does a simple fix to this: it drops the period as a > separator, using only spaces. That's all. I know it's not > RIS-compliant anymore, but I do not care, since I never used it and > will never since it=A0is flawed. And I manage false duplicates by hand, > which admittedly sucks, but hey, this is only a one-page long patch. It's fixed in CVS. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Bruce D'A. <bd...@fa...> - 2004-01-07 18:00:55
|
On Jan 7, 2004, at 9:26 AM, Marc Herbert wrote: > You really do not understand that, if some information is lost in this > great process, whatever its noble purpose is, some author may _never_ > see his name printed as he would like to, even when some stylesheets > allow it. The issues here are rather larger and more complicated than they appear on the face of it, and I think it's best to recognize this on both sides of the issue. One way to highlight this is to ask this simple question: How do you -- the person entering name data -- know exactly how the author intends their name to be represented? It's rarely adequate to look at a heading in an article, or in a reference list, for example. In an ideal world, there's be a central repository with definitive listings, available as a web service. Until that day, though (maybe never), this will always be a difficult issue. Bruce |
|
From: Markus H. <mar...@mh...> - 2004-01-07 19:49:25
|
Bruce D'Arcus writes: > In an ideal world, there's be a central repository with definitive > listings, available as a web service. Until that day, though (maybe > never), this will always be a difficult issue. > I'm afraid not even this would help. It's not that the publishers wouldn't care about the names of the cited authors, but to them a consistent bibliography formatting is more important than individual wishes. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-09 17:04:56
|
On Wed, 7 Jan 2004, Marc Herbert wrote: > > The database contains the name parts, plus a normalized > > representation for speeding up queries that happens to look like some > > formatted representation. When creating a bibliography, RefDB then has > > to assemble the name parts in a fashion that matches the requirements > > of the publisher. > > > > It is irrelevant how the cited author or the author > > writing the paper would like to represent that name. It is irrelevant to authoritarian stylesheets. > if some information is lost in this great process, whatever its > noble purpose is, some author may _never_ see his name printed as he > would like to, even when some stylesheets allow it. Please understand the arguments below as a very general discussion about "what should be stored in a bibliographic database" and NOT anymore as a discussion about "what should refdb do" or "what do you think about the RIS format". Or else please start a new thread. Thanks in advance. Let me try to sum up the issues. - there is a strong need for a "normalized" representation of names, to avoid false duplicates and enhance results of queries. - some formatting tools/stylesheets "normalize" your names, deciding if and where you should put periods, dashes, initializing or not, etc. - some less authoritarian publishers/formatting conventions leave more freedom about this, in order to please authors and grant them the right to write their (possibly "weird") name as they want. I think it's technically possible to please everyone, by isolating issues. Let's take the example of this problematic name: (<http://citeseer.nj.nec.com/context/153368/0>) Chu, H.K. Jerry (that's the precise way he writes it himself) Depending on the typist (errare humanum est), the given name becomes: - HK Jerry - H.-K. Jerry - Hsiao Keng Jerry - Hsiaokeng Jerry - etc. [of course, he could become much more severely mistyped, and then the reasoning below will be less efficient/interesting. But anyway nothing will worked for severe cases except firing the typist]. The database, being unable to tell which is the "right" spelling, or worst, not even in some cases being able to tell if all these writings designate the same person, should carefully preserve every character from every typist. So the database has no choice but storing the input "as is". Preferably pre-parsed, but without any character lost or added. All these inputs become (unfortunately, but what can you do?) different authors. Meanwhile, a "clever" algorithm that is aware of most common typing-names mistakes in our culture computes a "normalized" (or "reduced", or "projected") representation of the given name for each record. So only two differents ones here: - hkjerry - hsiaokengjerry Such a simple algorithm could be for instance: - throw away a set of characters (period, hyphen, apostrophe, space,...) - lowercase all characters - dump all diacritics - ... This "projected name" is stored in the author record, besides the typist input. It is _indexed_ and used to perform queries. It can be used to detect false duplicates easily and efficiently, including at input time! (i.e. "Don't you think you should rather write this name this way?") The sample algorithm above is just... an example. Obviously, the "cleverness" of the algorithm deserves more discussion (and another thread). This algorithm could be easily configurable, for instance depending on cultural specifities. Even better, there could be several projections used by the database, covering different scenarios. For instance, an concurrent "abbreviating" algorithm that takes only capitals could run and give: - HKJ thus collapsing many more different inputs, and offering the client an very efficient "search using initials" additional feature. Stylesheets can pick up all the information they need (preferably pre-parsed), and are free to normalize names as they want to at publishing time. Comments? |
|
From: Bruce D'A. <bd...@fa...> - 2004-01-09 17:42:50
|
Oh, and another thing: It'd be nice if in this discussion we could focus on how to handle this =20= in XML. Proper markup obviates the need for any mangling, after all. One person on the MODS list suggested this as a way to deal with a name =20= like "S. Michael Smith": <namePart type=3D"given">Steven</namePart> <namePart type=3D"primegiven">Michael</namePart> <namePart type=3D"family">Smith</namePart> I suggested as an alternative: <namePart type=3D"given" level=3D"2">Steven</namePart> <namePart type=3D"given" level=3D"1">Michael</namePart> <namePart type=3D"family">Smith</namePart> There are others still, e.g.: <namePart>Steven</namePart> <namePart type=3D"given">Michael</namePart> <namePart type=3D"family">Smith</namePart> While not ideal, it does introduce a distinction between the two names =20= that could be subject to processing. Here's a similar example I posted on my blog using vcard in an RDF =20 representation: =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<bqs:Person=A0rdf:parseType=3D"Resourc= e"> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:N=A0rdf:parseType=3D"Reso= urce"> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:Family>Snyders</vCa= rd:Family> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:Given>D</vCard:Give= n> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:Other>J</vCard:Othe= r> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0</vCard:N> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:ORG=A0rdf:parseType=3D"Re= source"> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:Orgname> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Vanderbilt=A0Univers= ity=A0School=A0of=A0Medicine =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0</vCard:Orgname> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0<vCard:Orgunit>Department=A0= of=A0Pharmacology</vCard:=20 Orgunit> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0</vCard:ORG> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0</bqs:Person> Bruce= |
|
From: Markus H. <mar...@mh...> - 2004-01-10 01:28:46
|
Bruce D'Arcus writes: > Oh, and another thing: > > It'd be nice if in this discussion we could focus on how to handle this > in XML. Proper markup obviates the need for any mangling, after all. > Well put. I'd prefer to spend my time implementing the MODS support rather than with fiddling with the RIS format that won't get any better than it is. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Bruce D'A. <bd...@fa...> - 2004-01-14 19:19:28
|
On Jan 14, 2004, at 12:29 PM, Marc Herbert wrote: >>> It'd be nice if in this discussion we could focus on how to handle=20= >>> this >>> in XML. Proper markup obviates the need for any mangling, after = all. > >> Well put. I'd prefer to spend my time implementing the MODS support >> rather than with fiddling with the RIS format that won't get any >> better than it is. > > Please, am I allowed to spend _my_=A0time? Sure, but it would help if your energy and interest in this matter=20 could contribute to an ideal solution to this thorny problem. Part of=20= the problem with RIS is the tagged format itself. XML solves this=20 problem, if properly used, so any real solution is going to have to lie=20= there IMHO, and in an improved data model. Bruce |
|
From: Markus H. <mar...@mh...> - 2004-01-10 01:28:31
|
Marc Herbert writes: > It is irrelevant to authoritarian stylesheets. > Could you provide an example of a publisher in the natural sciences or anywhere else whose author name formatting recommendations read: "Use whatever the bearer of that name prints on his letterhead?" Besides the difficulty to even obtain this information for all 100+ author names that an average bibliography carries, I'm not aware of any publisher allowing this. The result would be bibliography entries like: F.D. Roosevelt, Truman, Harry S., Chun Wu, Dwight D Eisenhower, Schmidt HHHW: A paper about something. Science 56:456, 2000. Do you think this is acceptable to anyone? Do you think this is readable? Is Chun the given name or the family name? > - there is a strong need for a "normalized" representation of names, > to avoid false duplicates and enhance results of queries. Agreed. > - some formatting tools/stylesheets "normalize" your > names, deciding if and where you should put periods, dashes, > initializing or not, etc. This is what I experience with all papers and books that I read at work. > - some less authoritarian publishers/formatting conventions leave more > freedom about this, in order to please authors and grant them the > right to write their (possibly "weird") name as they want. > I've never seen this in real life, and I'm glad I didn't. > > I think it's technically possible to please everyone, by isolating > issues. Let's take the example of this problematic name: > (<http://citeseer.nj.nec.com/context/153368/0>) > > Chu, H.K. Jerry (that's the precise way he writes it himself) > In a useful input format, this would turn into something like: <name> <familyname>Chu</familyname> <givenname type="abbrev">H</givenname> <givenname type="abbrev">K</givenname> <primegivenname>Jerry</primegivenname> </name> This is what the database needs to know in order to do something useful with the name. It does not care whether that guy prefers either of these: Chu, H. K. Jerry Chu,H.K.Jerry H.K.Jerry Chu or whatever. I'm getting tired of this, but this is about formatting. The XML example above is about input data. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
|
From: Marc H. <mar...@fr...> - 2004-01-14 17:20:36
|
On Sat, 10 Jan 2004, Markus Hoenicka wrote:
> Could you provide an example of a publisher in the natural sciences or
> anywhere else whose author name formatting recommendations read: "Use
> whatever the bearer of that name prints on his letterhead?" Besides
> the difficulty to even obtain this information for all 100+ author
> names that an average bibliography carries, I'm not aware of any
> publisher allowing this. The result would be bibliography entries
> like:
>
> F.D. Roosevelt, Truman, Harry S., Chun Wu, Dwight D Eisenhower,
> Schmidt HHHW: A paper about something. Science 56:456, 2000.
>
> Do you think this is acceptable to anyone? Do you think this is
> readable? Is Chun the given name or the family name?
This discussion is and has always been about given names and so-called
"middlenames" only. That was even stated in one "Subject:". I find
somewhat dishonest to suddenly pretend that I want to dump the whole
difference between family name(s) and given name(s). Of course this
would be ridiculous. I slighty reformulate your question as if it
answered my messages:
> Could you provide an example of a publisher in the natural sciences
> or anywhere else whose author name formatting recommendations read:
> "Use whatever the bearer of that GIVEN name prints on his
> letterhead?" ^^^^^
I made some quick statistics about BibTeX stylesheets and related
investigation to try to answer this question. BibTeX is the de-facto
format & tool to manage bibliographies with LaTeX. LaTeX is this small
typesetting system used by millions of people. BibTeX does not know
what is a "middlename" (just like most formats). It knows only 4
parts: - "von"
- last name(s)
- first name(s)
- suffix (e.g. "Jr")
All BibTeX stylesheets I have seen either do format the _given_
name(s):
- "as is" from the BibTeX file (their "database") {ff}
- abbreviate it and period-ize it {f.}
The BibTeX code for printing the given name(s) "as is" is {ff}, while
the code for abbreviating the given name(s) is {f.} or similar. See:
<http://www.eeng.dcu.ie/local-docs/btxdocs/btxhak/btxhak/node5.html>
Basically, the {ff} code means that the stylesheet does not want
to format the given name(s), maybe because it thinks this is too
error-prone. It just trusts the typist. The question is: did I made up
category {ff} ?
The standard unix LaTeX installation I have (TeTeX) is shipped with 10
stylesheets (excluding variations). Those are the very basic
bibliographic stylesheets used by all people that do not care to
design their own. Half of them are {ff}: 5 stylesheets (plain,
alpha, unsrt, amsplain and amsalpha) format the given name just as
given (i.e., they don't format it), while 5 remaining. "ieeetr, abbrv,
siam, apalike, acm" abbreviate it.
Of course, it's possible that all {ff} stylesheets are minor ones,
while all the ones used by professional publications are not. So I
looked for some precise examples of {ff} BibTeX stylesheets in the
LaTeX archive (CTAN) and also in this compilation:
<http://www.lecb.ncifcrf.gov/~toms/latex.html>
I found at least the following {ff} stylesheets for "real"
publications:
- American Mathematical Society
- American Journal of Human Genetics
- Methods in Enzymology
- Journal of Neuroscience
Besides BibTeX stylesheets, I also found some other real world
examples of lack of given name(s) formatting:
- All Elsevier's International Federation of Automatic Control (IFAC)
journals (stylesheet ifac.bst)
<http://authors.elsevier.com/getting_published.html?dc=QG3>
- The german DIN 1505 standard seems to let people free to decide how
their given name(s) should be written.
<http://www.phil-fak.uni-duesseldorf.de/ie/competence/09_schriftkom/bibdin.html#top>
- The MLA (Modern Language Association) style seems quite popular and
does neither "format" the given name(s). See BibTeX file "mla.bst" and
<http://www.english.uiuc.edu/cws/wworkshop/MLA/singleauthor.htm>
"The author's name should be given as it is listed on the title page
of the text."
> > - some less authoritarian publishers/formatting conventions leave more
> > freedom about this, in order to please authors and grant them the
> > right to write their (possibly "weird") name as they want.
> I've never seen this in real life, and I'm glad I didn't.
Again, I was of course talking about GIVEN names only, it did not
change from the start of the discussion.
Conclusion: I would of course never pretend that the majority of
publications let people write their _given_ name(s) as they want. I
just don't know. But a short investigation seems to show that at least
a non-negligeable number of non-negligeable journals does.
|