[Refdb-users] making reality and name normalization coexist
Status: Beta
Brought to you by:
mhoenicka
From: Marc H. <mar...@fr...> - 2004-01-09 17:04:56
|
On Wed, 7 Jan 2004, Marc Herbert wrote: > > The database contains the name parts, plus a normalized > > representation for speeding up queries that happens to look like some > > formatted representation. When creating a bibliography, RefDB then has > > to assemble the name parts in a fashion that matches the requirements > > of the publisher. > > > > It is irrelevant how the cited author or the author > > writing the paper would like to represent that name. It is irrelevant to authoritarian stylesheets. > if some information is lost in this great process, whatever its > noble purpose is, some author may _never_ see his name printed as he > would like to, even when some stylesheets allow it. Please understand the arguments below as a very general discussion about "what should be stored in a bibliographic database" and NOT anymore as a discussion about "what should refdb do" or "what do you think about the RIS format". Or else please start a new thread. Thanks in advance. Let me try to sum up the issues. - there is a strong need for a "normalized" representation of names, to avoid false duplicates and enhance results of queries. - some formatting tools/stylesheets "normalize" your names, deciding if and where you should put periods, dashes, initializing or not, etc. - some less authoritarian publishers/formatting conventions leave more freedom about this, in order to please authors and grant them the right to write their (possibly "weird") name as they want. I think it's technically possible to please everyone, by isolating issues. Let's take the example of this problematic name: (<http://citeseer.nj.nec.com/context/153368/0>) Chu, H.K. Jerry (that's the precise way he writes it himself) Depending on the typist (errare humanum est), the given name becomes: - HK Jerry - H.-K. Jerry - Hsiao Keng Jerry - Hsiaokeng Jerry - etc. [of course, he could become much more severely mistyped, and then the reasoning below will be less efficient/interesting. But anyway nothing will worked for severe cases except firing the typist]. The database, being unable to tell which is the "right" spelling, or worst, not even in some cases being able to tell if all these writings designate the same person, should carefully preserve every character from every typist. So the database has no choice but storing the input "as is". Preferably pre-parsed, but without any character lost or added. All these inputs become (unfortunately, but what can you do?) different authors. Meanwhile, a "clever" algorithm that is aware of most common typing-names mistakes in our culture computes a "normalized" (or "reduced", or "projected") representation of the given name for each record. So only two differents ones here: - hkjerry - hsiaokengjerry Such a simple algorithm could be for instance: - throw away a set of characters (period, hyphen, apostrophe, space,...) - lowercase all characters - dump all diacritics - ... This "projected name" is stored in the author record, besides the typist input. It is _indexed_ and used to perform queries. It can be used to detect false duplicates easily and efficiently, including at input time! (i.e. "Don't you think you should rather write this name this way?") The sample algorithm above is just... an example. Obviously, the "cleverness" of the algorithm deserves more discussion (and another thread). This algorithm could be easily configurable, for instance depending on cultural specifities. Even better, there could be several projections used by the database, covering different scenarios. For instance, an concurrent "abbreviating" algorithm that takes only capitals could run and give: - HKJ thus collapsing many more different inputs, and offering the client an very efficient "search using initials" additional feature. Stylesheets can pick up all the information they need (preferably pre-parsed), and are free to normalize names as they want to at publishing time. Comments? |