refdb-users Mailing List for RefDB (Page 102)
Status: Beta
Brought to you by:
mhoenicka
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(8) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(5) |
Feb
(8) |
Mar
(21) |
Apr
(4) |
May
(20) |
Jun
(18) |
Jul
(5) |
Aug
(4) |
Sep
(11) |
Oct
|
Nov
(5) |
Dec
(16) |
2003 |
Jan
(16) |
Feb
(28) |
Mar
(78) |
Apr
(96) |
May
(40) |
Jun
(52) |
Jul
(55) |
Aug
(119) |
Sep
(40) |
Oct
(30) |
Nov
(46) |
Dec
(50) |
2004 |
Jan
(121) |
Feb
(86) |
Mar
(97) |
Apr
(60) |
May
(75) |
Jun
(67) |
Jul
(110) |
Aug
(75) |
Sep
(92) |
Oct
(120) |
Nov
(27) |
Dec
(23) |
2005 |
Jan
(26) |
Feb
(58) |
Mar
(50) |
Apr
(73) |
May
(165) |
Jun
(11) |
Jul
(10) |
Aug
(17) |
Sep
(32) |
Oct
(25) |
Nov
(35) |
Dec
(21) |
2006 |
Jan
(74) |
Feb
(93) |
Mar
(24) |
Apr
(37) |
May
(45) |
Jun
(125) |
Jul
(101) |
Aug
(39) |
Sep
(10) |
Oct
(32) |
Nov
(36) |
Dec
(20) |
2007 |
Jan
(22) |
Feb
(2) |
Mar
(27) |
Apr
(35) |
May
(6) |
Jun
|
Jul
(19) |
Aug
(8) |
Sep
(3) |
Oct
(26) |
Nov
(15) |
Dec
(3) |
2008 |
Jan
(4) |
Feb
(4) |
Mar
(8) |
Apr
|
May
|
Jun
|
Jul
|
Aug
(4) |
Sep
|
Oct
(2) |
Nov
|
Dec
|
2009 |
Jan
(5) |
Feb
(39) |
Mar
(7) |
Apr
(24) |
May
(27) |
Jun
(5) |
Jul
(9) |
Aug
(12) |
Sep
(19) |
Oct
(16) |
Nov
|
Dec
(5) |
2010 |
Jan
(5) |
Feb
(4) |
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(5) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
(6) |
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(4) |
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2019 |
Jan
|
Feb
|
Mar
(6) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
(3) |
Feb
(5) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Markus H. <mar...@mh...> - 2004-01-07 17:56:02
|
Hi, Marc Herbert writes: > using the RIS input format to implement it is so wrong to me that I > prefer to forget about it for the moment. It's a tradeoff, and you Fine with me. The MODS-based data format will allow more flexibility in marking up names, but this does not obviate the need to normalize the names and the need to stick to an input format. The main difference that I see is a full support for "prime given" vs. "given" names regardless of their position. This will hold the distinction between what's currently called first and middle names, but without implying any sequence. > > The bottom line is: if you supply your RIS data according to the RIS > > input format, they won't be fiddled with at all. If you use a > > different format, e.g. by leaving out periods or by adding random > > spaces, RefDB attempts to mangle the data until they fit the RIS input > > format. This works in many cases, but may fail in border cases. > > This is crystal clear. Now my point: I care about border cases, and I > don't care about false duplicates. So I disable mangling. Simple! This > is a bit exaggerated, but you get the point. > No, you don't need to disable mangling. You simply have to supply the names according to the RIS specs, then they won't be mangled. And if you start to use the extended notes, you'll probably start to worry about duplicates in the author table. > >=A0The important thing to understand is that the dots and spaces used in t= > he > > RIS input format do not have anything to do with the final > > representation of a name in a formatted bibliography. > > > The sole purpose of the dots and spaces is to separate the name > > parts in order to tell the parser where to chop. > > The important thing to understand is that dots may be meaningful to > some author name in some language (including english), so they are > the not far from the worst separator ever. I'll await your examples. Abbreviated middle names in the angloamerican culture, so-called middle initials, are not an example for this. An initial is a capital letter by definition. You may represent your middle name in formatted output by appending a dot to the initial, but you don't have to. You can leave out the dot, or spell out your middle name. The initial is the data, the initial plus the dot is one of several possible representations of your middle name, i.e. it contains formatting information that does not belong into a database. > The middlename maybe "B" without being an initial. More generally, > the existence or the non-existence of the dot maybe an information > that some refdb user does not want to lose, at least not in the > database (even if he does not care about some formatted output). The dot is no information. It is formatting. Please separate data from formatting. Roosevelt's middle name was not "D.", therefore "D." cannot appear as a piece of data in the database. Roosevelt's middle name was "Delano". "D." is one of several possible ways to format his middle name. The dot does not convey any additional information even if you know only the initial and not the full name. The border cases like names that consist of a single letter (any examples?) will be handled gracefully only in an XML-based input format like MODS - by providing an appropriate attribute, not by fiddling with dots. > Thanks for this precision. I personally do not care. I obviously never > pretended to stay compatible with a format while arguing it is flawed. > Once again, that's the reason I did not even tried to put this patch > in some sourceforge tracker. I have to care as one of the goals of RefDB was to implement a reference manager that can exchange data with commercial tools. > > Whether or not to use spaces after initials is a formatting issue that > > is handled by the bibliography style. A period is enough as a > > separator for the internal representation. The spaces are redundant > > and bloat the data without a reason. > > A period is not a decent separator, since it may be part of user data. > Period. A period is either a textual separator (in an input format) or formatting (in a printed representation of the name), but no user data. > > The following input works just fine for me without any loss of data: > > > > <author> > > =09<lastname>Chu</lastname> > > =09<firstname>H</firstname> > > =09<middlename>K</middlename> > > =09<middlename>Jerry</middlename> > > </author> > > It does not work, because the firstname is: "H.K.", while the nickname > is "Jerry" (a "nickname" which is by the way a bit far from a > so-called "middlename"... anyway) > > It seems you cannot express "H.K." with the RIS syntax, since it uses > the period as a separator. What we see here, is the combination of a > culture-specific concept (middlename), with a flawed syntax (period as > separator). Maybe you should inform the author he mistyped his name, > since it does not conform to the RIS syntax. And oh no, please do not > tell me about the ugly and overcomplicated: "H.-K."... This depends on how this name spells out. I know he's Chinese but would it spell "Hans Karl" or "Hans-Karl"? And no, nicknames are no part of RIS, but I haven't seen a nickname in a citation either. And again, RefDB will not support names that can't be expressed in RIS syntax until a MODS-based data format is implemented. > > Please note that in the official examples given above, most of the > > output is correct although an improper input format was used. This > > is what normalization is all about. > > You explained just above that the main and noble purpose of > normalization is to avoid false duplicates. But here you go much > further since you: > - normalize the data from the typist completely and > irreversibly, and so the output. > - even ask him to make his _input_ RIS-compliant. > No, I've explained this previously and will do it again: If you input your data according to the specs, they won't be mangled. If you insist on using a different input format, RefDB will do it's best to use these data anyway but may fail in border cases. And it is of course mandatory to provide the input in a RIS-compliant format as the data format is based on RIS. I'm surprised that this seems new to you. > > > The only problem that I've come across while looking at these > > examples is that the current implementation does not handle > > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, > > but "Schleifer,K.-H." will cause problems to the best of my > > knowledge. I'll look into this and fix it if necessary. > > My patch does a simple fix to this: it drops the period as a > separator, using only spaces. That's all. I know it's not > RIS-compliant anymore, but I do not care, since I never used it and > will never since it=A0is flawed. And I manage false duplicates by hand, > which admittedly sucks, but hey, this is only a one-page long patch. It's fixed in CVS. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Marc H. <mar...@fr...> - 2004-01-07 14:26:42
|
On Tue, 6 Jan 2004, Markus Hoenicka wrote: > > The purpose of the name mangling is to reduce all names consistently > to the RIS input format. This is currently the common denominator of > both RIS and RISX input until a richer data format like MODS is > implemented. This is one of the core issues indeed (thanks for starting with it :-) The idea of trying to avoid false duplicates is great, even if it will never be 100% reliable and always depend for some part from the human typist (think for instance about: abbreviated or full input?). But using the RIS input format to implement it is so wrong to me that I prefer to forget about it for the moment. It's a tradeoff, and you surely can understand that I have a different perspective than you about it. Knowing in advance that our opinions differ, I posted this patch on my web page instead of in some sourceforge tracker. I can't see any FUD=A0in this, just a different use of your software. In some "MODS" future, if a name reduction format/scheme that I trust is available, then I will be happy to give it my data. Meanwhile, I prefer to keep it intact, since it does not fit into the RIS format. > If the name mangling is not consistent, then it is a bug > that needs to be fixed, not a feature that needs to be removed. Great! Unfortunately, if the syntax of the target format (RIS) is flawed from the start, you cannot achieve full consistency, whatever your efforts are. Moreover, even "half-consistency" becomes harder and prone to overcomplicated code and bugs, as we see. > The bottom line is: if you supply your RIS data according to the RIS > input format, they won't be fiddled with at all. If you use a > different format, e.g. by leaving out periods or by adding random > spaces, RefDB attempts to mangle the data until they fit the RIS input > format. This works in many cases, but may fail in border cases. This is crystal clear. Now my point: I care about border cases, and I don't care about false duplicates. So I disable mangling. Simple! This is a bit exaggerated, but you get the point. By the way, about border cases: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Repair: Repair what you can -- but when you must fail, fail noisily and as soon as possible. Software should be transparent in the way that it fails, as well as in normal operation. It's best when software can cope with unexpected conditions by adapting to them, but the worst kinds of bugs are those in which the repair doesn't succeed and the problem quietly causes corruption that doesn't show up until much later. Sorry, I do not have so many different..=A0references :-) >=A0The important thing to understand is that the dots and spaces used in t= he > RIS input format do not have anything to do with the final > representation of a name in a formatted bibliography. > The sole purpose of the dots and spaces is to separate the name > parts in order to tell the parser where to chop. The important thing to understand is that dots may be meaningful to some author name in some language (including english), so they are the not far from the worst separator ever. > You could use slashes or question marks just as well. If the designer of the RIS=A0format had had some clue, he could have spared us a lot of discussion and time. >=A0As it is the job of a bibliography software to output the > author names in all possible formatting variations, it is essential > not to store pre-formatted data in the database. > However, it may be useful (see below) to store pre-parsed data. Great, something we agree about! :-) > The same principle basically applies to the RISX input > format. However, the RISX format provides separate elements for the > name parts, so there is no need for textual separators at all. There > is no point to enter a middle initial as > <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." The middlename maybe "B" without being an initial. More generally, the existence or the non-existence of the dot maybe an information that some refdb user does not want to lose, at least not in the database (even if he does not care about some formatted output). > is a representation of a middle name which is used in some > bibliography styles (others don't use the dot or leave out the middle > name altogether) and can be trivially generated from "B". Therefore, a > <middlename>B</middlename> is all you need. If RefDB detects the > superfluous dot, it will remove it. I am really hopeless about making you understand how and why I disagree with: "the superfluous dot". Can't you just accept it as a fact? I also disagree with the "middlename" concept, but this was another story :-) > This is the key point why we have to argue at all. You do not > understand that the database does not contain a formatted string that > shows how you would like to see your name printed on a piece of > paper. > The database contains the name parts, plus a normalized > representation for speeding up queries that happens to look like some > formatted representation. When creating a bibliography, RefDB then has > to assemble the name parts in a fashion that matches the requirements > of the publisher. > It is irrelevant how the cited author or the author > writing the paper would like to represent that name. You really do not understand that, if some information is lost in this great process, whatever its noble purpose is, some author may _never_ see his name printed as he would like to, even when some stylesheets allow it. I will suggest in a next message (this one is already too long and too chaotic, and I still have to think a bit about it) a better solution that may please everyone (no more trade-off). Assuming you understand that I have slighty different refdb needs, so we can discuss about it. > > __________________________ > > Modifications to RIS input > > (i.e., "addref -t ris") > > > > [...] > > > RIS input examples > > > > Smith, F.M.N. > > Chu, H.K. Jerry > > Truman, Harry S > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F.M.N." "Smith" "F.M.N." > > > > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" > > > > official : "Truman,Harry S." "Truman" "Harry" "S " > > patched : "Truman,Harry S" "Truman" "Harry S" > > > Please note that the last output of the patched version does not > follow the RIS specs, therefore it is not clear whether RefMan, > EndNote and the like import this properly. Thanks for this precision. I personally do not care. I obviously never pretended to stay compatible with a format while arguing it is flawed. Once again, that's the reason I did not even tried to put this patch in some sourceforge tracker. > As stated above, you should not use periods anyway as they are not > required. Following this simple rule will make most of your complaints > obsolete. Unfortunately, this simple "no-periods" rule is not acceptable to me. Please do not forget to put it in the documentation, I really think it is important. > > RISX input examples > > > > "Smith" "F." "M." "N." > > "Truman" "Harry" "S" > > "Chu" "H.K." "Jerry" > > > > -> database results > > > > official : "Smith,F.M.N." "Smith" "F" "M N" > > patched : "Smith,F. M. N." "Smith" "F." "M. N." > Whether or not to use spaces after initials is a formatting issue that > is handled by the bibliography style. A period is enough as a > separator for the internal representation. The spaces are redundant > and bloat the data without a reason. A period is not a decent separator, since it may be part of user data. Period. > > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informa= tion loss!) > > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" > > > > Please provide the RISX input that you used for this example. It's just above (below "RISX input"). I used double quotes " instead of XML <tags>, to avoid clutter. > The following input works just fine for me without any loss of data: > > <author> > =09<lastname>Chu</lastname> > =09<firstname>H</firstname> > =09<middlename>K</middlename> > =09<middlename>Jerry</middlename> > </author> It does not work, because the firstname is: "H.K.", while the nickname is "Jerry" (a "nickname" which is by the way a bit far from a so-called "middlename"... anyway) It seems you cannot express "H.K." with the RIS syntax, since it uses the period as a separator. What we see here, is the combination of a culture-specific concept (middlename), with a flawed syntax (period as separator). Maybe you should inform the author he mistyped his name, since it does not conform to the RIS syntax. And oh no, please do not tell me about the ugly and overcomplicated: "H.-K."... > Please note that in the official examples given above, most of the > output is correct although an improper input format was used. This > is what normalization is all about. You explained just above that the main and noble purpose of normalization is to avoid false duplicates. But here you go much further since you: - normalize the data from the typist completely and irreversibly, and so the output. - even ask him to make his _input_ RIS-compliant. > The only problem that I've come across while looking at these > examples is that the current implementation does not handle > abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, > but "Schleifer,K.-H." will cause problems to the best of my > knowledge. I'll look into this and fix it if necessary. My patch does a simple fix to this: it drops the period as a separator, using only spaces. That's all. I know it's not RIS-compliant anymore, but I do not care, since I never used it and will never since it=A0is flawed. And I manage false duplicates by hand, which admittedly sucks, but hey, this is only a one-page long patch. By the way, the only specification about RIS names syntax I could find is here: <http://www.refman.com/support/risformat_tags_02.asp> and it says nothing about periods nor middlenames. Do you have a better reference? and... publically available? Thanks for the time to answer, and thanks again for refdb. Quoting you to conclude: > Otherwise this is an example of the beauty of free software. If you > code this for yourself, everyone can have it his way. Cheers, Marc. |
From: Bruce D'A. <bd...@fa...> - 2004-01-06 21:19:35
|
An interesting attempt to bring together the FRBR (an ambitious attempt to redefine bibliographic metadata), with RDF and MySQL. Early schema definitions at: http://disobey.com/noos/LibDB/index.cgi?DatabaseSchema Bruce |
From: Markus H. <mar...@mh...> - 2004-01-06 15:40:19
|
Marc Herbert writes: >=20 > > > I'll be happy to add a section to the docs in all caps and a red= box > > > around it stating that author names will be normalized for the s= ake of > > > consistency. >=20 > I could not find this yet in > <http://refdb.sourceforge.net/manual-0.9.4/book1.html> >=20 > Explaining "how" they are normalized also seems rather vital to me. >=20 Sorry. Didn't get round to it yet. > BTW, while testing and comparing, I found some quirks that do not se= em > to fit _any_ logic (as opposed to: not fit my taste). >=20 In this case you should document them and file a bug report instead of spreading FUD. > ---------------------------------------------------------- >=20 > The "reversible" refdb patch >=20 > Marc Herbert > $Date: 2004/01/05 21:30:50 $ > $Revision: 1.2 $ >=20 >=20 > ---- The issue ---- >=20 > Currently, refdb tries to "normalize" authors' name inputed in the > database, in order to avoid false duplicates and maybe to cope with > weird requirements of some bibliographic stylesheets. This means > fiddling with full stops and so-called "middlenames". >=20 > I think refdb should either reliably perform this normalization > according to a documented, reviewed and formal specification > -- or not at all. Today it does it in an undocumented way, > silently modifying some user data with potential information loss in= > corner cases. >=20 The purpose of the name mangling is to reduce all names consistently to the RIS input format. This is currently the common denominator of both RIS and RISX input until a richer data format like MODS is implemented. If the name mangling is not consistent, then it is a bug that needs to be fixed, not a feature that needs to be removed. The bottom line is: if you supply your RIS data according to the RIS input format, they won't be fiddled with at all. If you use a different format, e.g. by leaving out periods or by adding random spaces, RefDB attempts to mangle the data until they fit the RIS input format. This works in many cases, but may fail in border cases. The important thing to understand is that the dots and spaces used in the RIS input format do not have anything to do with the final representation of a name in a formatted bibliography. The sole purpose of the dots and spaces is to separate the name parts in order to tell the parser where to chop. You could use slashes or question marks just as well. As it is the job of a bibliography software to output the author names in all possible formatting variations, it is essential not to store pre-formatted data in the database. However, it may be useful (see below) to store pre-parsed data. The same principle basically applies to the RISX input format. However, the RISX format provides separate elements for the name parts, so there is no need for textual separators at all. There is no point to enter a middle initial as <middlename>B.</middlename>. The middle initial is "B", not "B.". "B." is a representation of a middle name which is used in some bibliography styles (others don't use the dot or leave out the middle name altogether) and can be trivially generated from "B". Therefore, a <middlename>B</middlename> is all you need. If RefDB detects the superfluous dot, it will remove it. > This (short and simple) refdb patch disables all modifications of > user-data, and lets the user decide by himself how names should be > "normalized" (assuming it's both desirable and possible). > Thanks to it, what gets _in_ refdb, gets _out_ untouched. > For instance, if you enter "Harry S Truman" in refdb, you would get = back: > - without this patch: "Harry S. Truman" > - with this patch: "Harry S Truman" (amazing! and "reversibl= e"...) >=20 Now we get to the purpose of normalization. As stated above, the data in the AU field of a RIS dataset or an <author> element are not strings that are supposed to be inserted into a bibliography as they are. They are input formats that supply data (the name parts) for one object in the database (an author). If an author has several reference entries in the database, these entries must link to the same object (the author), not to a specific representation of the author's name. Assume the following cases: Truman,Harry S. Truman,Harry S Truman, Harry S Truman, Harry S. The first one is what the RIS input format asks for. The others aren't that different except for a space or a dot here and there. If these belong to four references among 100, you probably wouldn't even notice that the author names are written differently, although it is clear that they mean the same author. If you add these four datasets to RefDB, the first entry won't be mangled at all (as it sticks to the rules). The other entries are normalized, and as a consequence, all four references link to the same author. The normalized internal representation of the author name is "Truman,Harry S." (amazing! and "reversible"...). If you go ahead and prevent this normalization, the four references will point to four different author objects, one with the representation "Truman,Harry S.", another one with the representation "Truman,Harry S", and so forth. If you now run a query for references by some "Truman,Harry S.", you'll miss 75% of the possible hits. This is not good. You can obviously work around this weakness of the patch by running all queries against regular expressions, but this is not an option if you design a simplified interface that allows users to pick names from a list (something Mike is currently working on). > Warning: this patch may or may not break further formatting by some > bibliographic stylesheets, depending if they expect "normalized" nam= es > from the database. I do not care much about breaking stylesheets tha= t > want you to change the way you write your name (probably in a more > "english" way). I do not mind if they munge names when formatting f= or > publication, but pushing this "normalization" up to the database is > not acceptable to me. After all, respectful and less rigid formattin= g > tools also (co-)exist. The answer to this question is likely to be = in > the following function: backend-dbiba.c:format_firstmiddlename() This is the key point why we have to argue at all. You do not understand that the database does not contain a formatted string that shows how you would like to see your name printed on a piece of paper. The database contains the name parts, plus a normalized representation for speeding up queries that happens to look like some formatted representation. When creating a bibliography, RefDB then has to assemble the name parts in a fashion that matches the requirements of the publisher. It is irrelevant how the cited author or the author writing the paper would like to represent that name. >=20 > By the way, be aware that you should NOT use spaces at the beginning= > or at the end of RISX <name>(s), since this will lead to false > duplicates in the database _independently from this patch_. On the > other hand, RIS input (AU - field) is more or less space-insensitive= . >=20 The RIS input is insensitive to leading and trailing spaces as the latter are basically invisible in this input format. I have not anticipated that anyone would add stray spaces to XML elements as they are easily detected, but if this is a common problem it could be handled just as well. >=20 >=20 > The SQL database uses 4 (redundant) fields to store author names: > fullname, lastname, firstname, middlenameS >=20 The columns are not redundant. Redundancy implies that they hold the same information but this is not the case. author_lastname, author_firstname, and author_middlename hold the pre-parsed name parts which are different by definition. The author_name field holds the normalized representation of the full name or a corporate name. The latter doesn't have name parts but it can't go into e.g. author_lastname either as we have to distinguish between authors that have only one name and corporate names. The only redundancy in this setup is that a non-corporate name could be assembled from the name parts. However, author names are usually added once and then queried each time someone requests a reference or a bibliography containing that name. For the sake of speed it makes sense to parse the name once (when you add it) instead of each time it is retrieved. > __________________________ > Modifications to RIS input > (i.e., "addref -t ris") >=20 [...] > RIS input examples >=20 > Smith, F.M.N. > Chu, H.K. Jerry > Truman, Harry S >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F.M.N." "Smith" "F.M.N." >=20 > official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " > patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" >=20 > official : "Truman,Harry S." "Truman" "Harry" "S " > patched : "Truman,Harry S" "Truman" "Harry S" >=20 > (also notice the spurious space ending some middlenames with the > official version). These spaces are due to a bug introduced after adding support for multiple middle names. Fixed in CVS. Please note that the last output of the patched version does not follow the RIS specs, therefore it is not clear whether RefMan, EndNote and the like import this properly. >=20 >=20 > ____________________________ > Mmodifications to RISX input > (i.e., "addref -t risx") >=20 > - full stops "tricks" are disabled >=20 As stated above, you should not use periods anyway as they are not required. Following this simple rule will make most of your complaints obsolete. > RISX input examples >=20 > "Smith" "F." "M." "N." > "Truman" "Harry" "S" > "Chu" "H.K." "Jerry" >=20 > -> database results >=20 > official : "Smith,F.M.N." "Smith" "F" "M N" > patched : "Smith,F. M. N." "Smith" "F." "M. N." Whether or not to use spaces after initials is a formatting issue that is handled by the bibliography style. A period is enough as a separator for the internal representation. The spaces are redundant and bloat the data without a reason. >=20 > official : "Truman,Harry S." "Truman" "Harry" "S" > patched : "Truman,Harry S" "Truman" "Harry" "S" >=20 Again, the patched output may not be readable by other tools using RIS.= > official : "Chu,H.Jerry" "Chu" "H" "Jerry" (infor= mation loss!) > patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" >=20 Please provide the RISX input that you used for this example. The following input works just fine for me without any loss of data: <author> =09<lastname>Chu</lastname> =09<firstname>H</firstname> =09<middlename>K</middlename> =09<middlename>Jerry</middlename> </author> (the markup is odd but RISX currently does not support something like a "prime" given name which is not in the first position, as in "M. Steven Miller". RIS does not support this either, so this will be handled properly only by the forthcoming MODS-like data model) Please note that in the official examples given above, most of the output is correct although an improper input format was used. This is what normalization is all about. The only problem that I've come across while looking at these examples is that the current implementation does not handle abbreviated double names very well. "Schleifer,Karl-Heinz" is ok, but "Schleifer,K.-H." will cause problems to the best of my knowledge. I'll look into this and fix it if necessary. > However, for some unknown reason, bibtex output pulls the fullname > from the database and parses it again, so a small patch was needed > here again to prevent the addition of full stops. >=20 The "unknown reason" is negligence. I haven't heard positively of anyone using the bibtex output, so this gets somewhat less attention than it should. >=20 > __________ > Convertors >=20 > The "nmed2ris" convertor also fiddles with authors' names in a simil= ar > way. I can not yet say more about this, sorry: I do not use the > MED=A0format at all and could not have tested modifications. >=20 It is clearly stated in the manual that this program is obsolete and will eventually be removed from the distribution. If at all, have a look at the med2ris.pl script. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Marc H. <mar...@fr...> - 2004-01-05 22:44:30
|
> > I'll be happy to add a section to the docs in all caps and a red box > > around it stating that author names will be normalized for the sake o= f > > consistency. I could not find this yet in <http://refdb.sourceforge.net/manual-0.9.4/book1.html> Explaining "how" they are normalized also seems rather vital to me. > > > OK: I suggest one *extremely* simple improvement to this code: the > > > ability to disable it, at least at configure time (I will code thi= s > > > for myself in any case). > > Otherwise this is an example of the beauty of free software. If you > > code this for yourself, everyone can have it his way. It's done. See: <http://marc.herbert.free.fr/refdb/reversible/> or below/attached. Comments welcome (including from you, Markus :-) BTW, while testing and comparing, I found some quirks that do not seem to fit _any_ logic (as opposed to: not fit my taste). Cheers, Marc. ---------------------------------------------------------- The "reversible" refdb patch Marc Herbert $Date: 2004/01/05 21:30:50 $ $Revision: 1.2 $ ---- The issue ---- Currently, refdb tries to "normalize" authors' name inputed in the database, in order to avoid false duplicates and maybe to cope with weird requirements of some bibliographic stylesheets. This means fiddling with full stops and so-called "middlenames". I think refdb should either reliably perform this normalization according to a documented, reviewed and formal specification -- or not at all. Today it does it in an undocumented way, silently modifying some user data with potential information loss in corner cases. This (short and simple) refdb patch disables all modifications of user-data, and lets the user decide by himself how names should be "normalized" (assuming it's both desirable and possible). Thanks to it, what gets _in_ refdb, gets _out_ untouched. For instance, if you enter "Harry S Truman" in refdb, you would get back: - without this patch: "Harry S. Truman" - with this patch: "Harry S Truman" (amazing! and "reversible"...= ) Warning: this patch may or may not break further formatting by some bibliographic stylesheets, depending if they expect "normalized" names from the database. I do not care much about breaking stylesheets that want you to change the way you write your name (probably in a more "english" way). I do not mind if they munge names when formatting for publication, but pushing this "normalization" up to the database is not acceptable to me. After all, respectful and less rigid formatting tools also (co-)exist. The answer to this question is likely to be in the following function: backend-dbiba.c:format_firstmiddlename() By the way, be aware that you should NOT use spaces at the beginning or at the end of RISX <name>(s), since this will lead to false duplicates in the database _independently from this patch_. On the other hand, RIS input (AU - field) is more or less space-insensitive. This patch is compatible with version 0.9.4-pre3, and _not_ with version 0.9.3. Users (yet...) satisfied with current refdb behaviour and thus not directly interested by this patch, may still be interested in understanding how their data is modified; just having a look at this patch will provide detailed answers. The summary of changes just below also explains (in english instead of C). This patch also disables middlename(s) input in the RIS format, due to a flawed RIS input syntax, and due to their controversial nature (see http://sourceforge.net/mailarchive/forum.php?forum_id=3D1798&viewmonth=3D= 200312); all RIS "given names" go together untouched into the "firstname" database field. On the other hand, RISX <middlename>s are not disabled by this patch. To disable middlenames in RISX, just... don't use the tag <middlename>. ---- Detailed issues and modifications ---- The SQL database uses 4 (redundant) fields to store author names: fullname, lastname, firstname, middlenameS __________________________ Modifications to RIS input (i.e., "addref -t ris") firstname/middlenames parsing is disabled. - the patch disables fiddling with full stops. - middlenames are disabled: inside the AU field, the whole "given name" as delimited by commas, goes into the "firstname" database field. RIS input examples Smith, F.M.N. Chu, H.K. Jerry Truman, Harry S -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F.M.N." "Smith" "F.M.N." official : "Chu,H.K.Jerry" "Chu" "H" "K Jerry " patched : "Chu,H.K.Jerry" "Chu" "H.K.Jerry" official : "Truman,Harry S." "Truman" "Harry" "S " patched : "Truman,Harry S" "Truman" "Harry S" (also notice the spurious space ending some middlenames with the official version). ____________________________ Mmodifications to RISX input (i.e., "addref -t risx") - full stops "tricks" are disabled RISX input examples "Smith" "F." "M." "N." "Truman" "Harry" "S" "Chu" "H.K." "Jerry" -> database results official : "Smith,F.M.N." "Smith" "F" "M N" patched : "Smith,F. M. N." "Smith" "F." "M. N." official : "Truman,Harry S." "Truman" "Harry" "S" patched : "Truman,Harry S" "Truman" "Harry" "S" official : "Chu,H.Jerry" "Chu" "H" "Jerry" (informatio= n loss!) patched : "Chu,H.K. Jerry" "Chu" "H.K." "Jerry" _______ Outputs No output expect bibtex's is modified. RIS output dumps "as is" the first field of the SQL database (fullname). RISX output uses the 3 other fields (last, first, middles). It dumps last and firstname untouched, then parse the "middlenames" field according to spaces before dumping <middlename>s elements. The patch does modify neither RIS nor RISX output. Most other outputs also work one way or the other, and are not modified by the patch. However, for some unknown reason, bibtex output pulls the fullname from the database and parses it again, so a small patch was needed here again to prevent the addition of full stops. __________ Convertors The "nmed2ris" convertor also fiddles with authors' names in a similar way. I can not yet say more about this, sorry: I do not use the MED=A0format at all and could not have tested modifications. ________ Feedback Since all this is unfortunably complicated, the probability that I missed something despite all my efforts is non-zero. I thank you in advance for any feedback. ___________________________ The art of Unix Programming Some food for thought from: <http://catb.org/~esr/writings/taoup/html/ch01s06.html> Rule of Transparency: design for visibility to make inspection and debugging easier. For a program to demonstrate its own correctness, it needs to be using input and output formats sufficiently simple so that the proper relationship between valid input and correct output is easy to check. Rule of Least Surprise: In interface design, always do the least surprising thing. |
From: Markus H. <mar...@mh...> - 2003-12-31 01:37:43
|
Markus Hoenicka writes: > - Updating the personal information may lead to loss of data in the > previous 0.9.4 prereleases (0.9.3 and earlier were not > affected). This is a *very good* reason to upgrade if you run a > 0.9.4 prerelease currently. Come to think of it, 0.9.4-pre1 was not affected either. The bug crept in at 0.9.4-pre2. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2003-12-31 01:33:38
|
Hi all, I'm sure most of you have better things to do at this time of the year, but there's a new prerelease of RefDB available for testing. It is about time to release 0.9.4 and this is supposed to be the last prerelease before the rollout. The prerelease is available right here: http://refdb.sourceforge.net/pre/refdb-latest.tar.gz See the NEWS file for the full details about what has changed compared to previous versions. The main issues are: - You'll have to recreate the system database and the reference database as the schemas have changed compared to 0.9.3 and compared to 0.9.4-pre2. See the file UPGRADING if you want to keep your existing data. - RefDB now supports extended notes. These notes can be freely linked to any number of references, author names, keywords, and periodical names. A couple of new commands (addnote/updatenote, getnote, deletenote, addlink) are available in refdbc to manage these notes. The query language was slightly modified to allow searching for references which are linked to particular notes and vice versa. See the documentation for 0.9.4 which is already available at http://refdb.sourceforge.net/doc.html - Updating the personal information may lead to loss of data in the previous 0.9.4 prereleases (0.9.3 and earlier were not affected). This is a *very good* reason to upgrade if you run a 0.9.4 prerelease currently. There's one thing you should look out for when testing this prerelease: - strictly speaking, RefDB requires the current CVS versions of both libdbi and libdbi-drivers due to bugfixes in the datetime parsing (libdbi) and the added support for DATE and TIME types (libdbi-drivers, pgsql driver only; the other drivers used to support this already). There was a small change in the API unrelated to these fixes, but this might cause problems if you build RefDB with the latest official libdbi release. I believe I've worked around this but I couldn't verify this myself. If you do use the latest releases and run pgsql, please expect some hiccups that I want to know about. I'd also like to point out that Michael Smith started to develop a new Emacs minor mode as a graphical front-end for RefDB. It works just fine in conjunction with ris.el. refdb-mode.el has not been released officially but it is available from CVS in the elisp directory. It is not yet feature-complete but it is improving rapidly. Please give it a try and provide feedback for Mike. All existing documentation is at the top of refdb-mode.el. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2003-12-21 21:03:09
|
Bruce D'Arcus writes: > Any obvious reasons why I get this? > Yes -- I'm not trying to be sarcastic but the most likely reason is that you're using a Mac. Building Text::Iconv requires a hack on some platforms (including Cygwin) to link against the proper libiconv library. This may be the case as well for OSX. If your installation of Text::Iconv failed for this reason, you're likely to see this error. Can you go back to the installation of Text::Iconv and see whether this causes any problems? Can you check the status of libiconv? regards, Markus > # en2ris.pl --help > Can't load '/usr/local/lib/libIconv.dylib' for module Text::Iconv: > /usr/local/lib/libIconv.dylib(2): Not a recognisable object file > at /usr/local/bin/en2ris.pl line 31 > Compilation failed in require at /usr/local/bin/en2ris.pl line 31. > BEGIN failed--compilation aborted at /usr/local/bin/en2ris.pl line 31. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Refdb-users mailing list > Ref...@li... > https://lists.sourceforge.net/lists/listinfo/refdb-users > > -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Bruce D'A. <bd...@fa...> - 2003-12-21 16:47:47
|
Any obvious reasons why I get this? # en2ris.pl --help Can't load '/usr/local/lib/libIconv.dylib' for module Text::Iconv: /usr/local/lib/libIconv.dylib(2): Not a recognisable object file at /usr/local/bin/en2ris.pl line 31 Compilation failed in require at /usr/local/bin/en2ris.pl line 31. BEGIN failed--compilation aborted at /usr/local/bin/en2ris.pl line 31. |
From: Markus H. <mar...@mh...> - 2003-12-20 23:39:40
|
Hi Marc, Marc Herbert writes: > The reason while it would break import from RefMan seems quite obvious > to me: according to this documentation, RefMan does NOT=A0support > so-called "middlenames". > <http://www.refman.com/support/risformat_tags_02.asp> > "For Firstname, you can use full names, initials, or both." > You'll have to look a little closer than that, and maybe get some hands-on experience with these kinds of tools. Middle names are supported implicitly by assuming the first non-lastname is the first name and any other non-lastname is a middle name. This is e.g. very apparent if you look at the RefMan style definitions which support the formatting of last, first, and middle names (using exactly these terms). > > I just wanted to point out that it does not make much sense to me to > > code half-parsed strings in XML when you have to parse anyway. Why not > > go the extra inch and do it right? > > Because the concept of middlenames is not part of any data model > (except risx), but only of some specific _formatting_ needs. > We're running in circles, I guess. These specific formatting needs imply that your data models allows to distinguish the parts of the data which need to be formatted differently. You would never expect the DocBook stylesheets to format a plain text file successfully, but for some reason you expect this for author names given more or less as plain text. > > > The middlename handling and abbreviating stuff is not at your > > discretion. If a style requires these modifications it does not make > > any sense to add a switch that will produce incorrect data. > > Yes, because other styles will require something else. Thus a "switch" > to satisfy all of them. The "--[not]-life-sciences" switch :-) > I don't see your point here. If all non-life sciences applications do not require the distinction between first and middle names, their style specifications will be a little simpler, that's all. > > That is, re-parse the name string each time a query comes in? It > > couldn't come any worse. > > I found very interesting to note that this "so bad" re-parsing is > exactly what happens in _today's_ code, in the case of several > middlenames. Search for "strtok" in: > <http://cvs.sourceforge.net/viewcvs.py/*checkout*/refdb/refdb/src/backend= > -risx.c?content-type=3Dtext%2Fplain&rev=3D1.20> > > I know: you will change this later. But still, it seems to work today. > No, I was talking about the SQL query that tries to match the incoming query against the available datasets. This is currently done against the normalized representation of the full name. No re-parsing happens at this stage as it would grossly affect the performance. Things are a little different if we're talking about generating output from these data. Middle names are currently stored as a list of tokens in a single field. I believe (that is, I didn't run any benchmarks) that tokenizing this list for those backends that actually require this is faster than using an additional table plus joins for all backends, even for those that don't bother. The backends that you'll be using most of the time (scrn or html for locating references) use the normalized representation and hence to not tokenize the middle name list. > > > OK: I suggest one *extremely* simple improvement to this code: the > > > ability to disable it, at least at configure time (I will code this > > > for myself in any case). > > > This does not make sense as it breaks consistent searching and the > > bibliography formatting. > > "Consistent searching" across...=A0different refdb installations !? > Consistent searching across all names. > > > Otherwise this is an example of the beauty of free software. If you > > code this for yourself, everyone can have it his way. > > Sure ! I will, I will... > > Time for a "contrib/" directory ? :-) > I'd be very reluctant to add code to a contrib directory that would not work with the rest of the application. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Marc H. <mar...@en...> - 2003-12-19 22:44:30
|
On Thu, 11 Dec 2003, Markus Hoenicka wrote: > Marc Herbert writes: > > I am glad to hear this! Then fix the _automated_ RIS parsing/syntax = by > > adding a comma to it? > Where would you like to have an additional comma? I'd be reluctant to > do this anyway as this would break data import from RefMan and > EndNote, but the RIS syntax uses two commas anyway. One to separate > the last name from the rest, and one to separate the suffix from the > rest. I was suggesting a comma between each firstname or middlename, in order to have (at least) the same middlename data model in both RISX=A0and RIS, and an un-ambiguous RIS=A0syntax. The reason while it would break import from RefMan seems quite obvious to me: according to this documentation, RefMan does NOT=A0support so-called "middlenames". <http://www.refman.com/support/risformat_tags_02.asp> "For Firstname, you can use full names, initials, or both." How do people in life sciences work with RefMan? It would be interesting to know. > > I did not know publishers of 5000 life sciences journals where so > > english-centric and ignorant of foreign cultures. This bug is quite > > amazing. > > > > It's sad but I don't see it as my job to change this. So be happy that: it's absolutely not what I was asking for (see previous messages). > I just wanted to point out that it does not make much sense to me to > code half-parsed strings in XML when you have to parse anyway. Why not > go the extra inch and do it right? Because the concept of middlenames is not part of any data model (except risx), but only of some specific _formatting_ needs. > The middlename handling and abbreviating stuff is not at your > discretion. If a style requires these modifications it does not make > any sense to add a switch that will produce incorrect data. Yes, because other styles will require something else. Thus a "switch" to satisfy all of them. The "--[not]-life-sciences" switch :-) > > No it's not too late: you can also play the same game with dots and > > spaces later at search/formatting time, without subtly and silently > > modifying the data that the user intently input; that is losing > > information really. > That is, re-parse the name string each time a query comes in? It > couldn't come any worse. I found very interesting to note that this "so bad" re-parsing is exactly what happens in _today's_ code, in the case of several middlenames. Search for "strtok" in: <http://cvs.sourceforge.net/viewcvs.py/*checkout*/refdb/refdb/src/backend= -risx.c?content-type=3Dtext%2Fplain&rev=3D1.20> I know: you will change this later. But still, it seems to work today. > > Please do never silently and subtly modify user data. At least ask f= or > > confirmation! The real world is too complex for any "clever" names > > standardization algorithm. > > I'll be happy to add a section to the docs in all caps and a red box > around it stating that author names will be normalized for the sake of > consistency. Thanks in advance! (I consider this a minimum before modifying user data). > > OK: I suggest one *extremely* simple improvement to this code: the > > ability to disable it, at least at configure time (I will code this > > for myself in any case). > This does not make sense as it breaks consistent searching and the > bibliography formatting. "Consistent searching" across...=A0different refdb installations !? > Otherwise this is an example of the beauty of free software. If you > code this for yourself, everyone can have it his way. Sure ! I will, I will... Time for a "contrib/" directory ? :-) Cheers, Marc. |
From: Bruce D'A. <bd...@fa...> - 2003-12-18 14:00:33
|
Marc wanted to know what the date option does for names in MODS. Here's an example: <name type="personal"> <namePart>Lamberton, John Porter</namePart> <namePart type="date">1839-1917</namePart> <role> <roleTerm type="text">joint ed.</roleTerm> </role> <role> <roleTerm authority="marcrelator" type="code">prf</roleTerm> </role> </name> |
From: Markus H. <mar...@mh...> - 2003-12-18 05:43:33
|
Pollyanna writes: > I invoked "redbd" with > refdbd -s -e 0 -l 7 > and sent > viewstat > from "refdba". The response of "refdbd" was > adding client on fd 5 > server waiting n_max_fd=5 > Segmentation fault I believe this bug is a good old friend that has been fixed in 0.9.4-pre1. Please use this prerelease: http://refdb.sourceforge.net/pre/refdb-0.9.4-pre1.tar.gz or this newer one with the extended notes support (still experimental): http://refdb.sourceforge.net/pre/refdb-0.9.4-pre2.tar.gz hope this helps, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Pollyanna <Pol...@li...> - 2003-12-18 02:29:28
|
Dear List, after installing "refdb 0.9.3" out of the "refdb-starterkit" according to the "readme.txt" supplied there, I tried to test it like described in the handbook. "Expat-1.95.6" was already installed. I invoked "redbd" with refdbd -s -e 0 -l 7 and sent viewstat from "refdba". The response of "refdbd" was adding client on fd 5 server waiting n_max_fd=5 Segmentation fault before ending itself. 8-( "refdba" stated server error: incorrect scramble string 8-( I don`t know where to look for further hints. I did the installation according to the documentation on a "Slackware 9.1" box with "MySQL 4.0.15a" as database. The only thing that seems worth mentioning to me is that I used "checkinstall 1.5.3" instead of the ordinary "install" command. Could "checkinstall" messed "refdb" up? Regards Pollyanna _____________________________________________________________ Linux.Net -->Open Source to everyone Powered by Linare Corporation http://www.linare.com/ |
From: Bruce D'A. <bd...@fa...> - 2003-12-15 00:36:00
|
Turns out the MARC/MODS subject structure is more flexible/complex than I'd realized. Here's a record I just coded: <?xml version="1.0" encoding="utf-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="Tilly2000a"> <name type="personal"> <namePart type="given">Charles</namePart> <namePart type="family">Tilly</namePart> <role> <roleTerm type="text">reviewer</roleTerm> </role> </name> <titleInfo> <title>Review of Moral Economy and Popular Protest</title> </titleInfo> <subject> <topic>riots</topic> <name type="personal"> <namePart type="given">Edward</namePart> <namePart type="given">P.</namePart> <namePart type="family">Thompson</namePart> </name> <titleInfo> <title>Moral Economy and Popular Protest</title> <subTitle>Crowds, Conflict and Authority</subTitle> </titleInfo> <name type="personal"> <namePart type="given">Adrian</namePart> <namePart type="family">Randall</namePart> <role> <roleTerm type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andrew</namePart> <namePart type="family">Charlesworth</namePart> <role> <roleTerm type="text">editor</roleTerm> </role> </name> </subject> <genre>review</genre> <relatedItem type="host"> <titleInfo> <title>Journal of Interdisciplinary History</title> </titleInfo> <typeOfResource>text</typeOfResource> <originInfo> <dateIssued>2000</dateIssued> <issuance>continuing</issuance> </originInfo> <genre>periodical</genre> <part> <detail type="volume"><number>31</number></detail> <detail type="issue"><number>2</number></detail> <extent unit="page"> <start>259</start> <end>260</end> </extent> </part> </relatedItem> <recordInfo> <recordCreationDate encoding="w3cdtf">2003-12-11</recordCreationDate> <recordIdentifier source="citekey">Tilly2000a</recordIdentifier> </recordInfo> </mods> </modsCollection> |
From: Bruce D'A. <bd...@fa...> - 2003-12-13 22:54:02
|
On Dec 13, 2003, at 4:07 PM, Markus Hoenicka wrote: > Bruce D'Arcus writes: >> I suggested this to Markus privately as an alternative: >> >> Keep the current structure, but rename "note" to "content," which >> makes >> it all more consistent with RDF, Atom, etc. > > And, add a type and xml:lang attribute? Seems to make sense to me. Here's an example of Atom where it stands now. Might be worth looking at some more. Files at: http://tbray.org/ongoing/pie/0.2/ <?xml version="1.0" encoding="utf-8"?> <feed version="0.2" xmlns="http://purl.org/atom/ns#"> <!-- required elements --> <title>dive into mark</title> <link>http://diveintomark.org/</link> <modified>2003-08-05T12:29:29Z</modified> <!-- optional elements --> <tagline>A lot of effort went into making this effortless</tagline> <id>tag:diveintomark.org,2003:3</id> <generator name="Movable Type">http://www.movabletype./org/?v=2.64</generator> <copyright>Copyright (c) 2003, Mark Pilgrim</copyright> <entry> <!-- required elements --> <title>Atom 0.2 snapshot</title> <link>http://diveintomark.org/2003/08/05/atom02</link> <id>tag:diveintomark.org,2003:3.2397</id> <issued>2003-08-05T08:29:29-04:00</issued> <modified>2003-08-05T18:30:02Z</modified> <!-- optional elements --> <created>2003-08-05T12:29:29Z</created> <summary>The Atom 0.2 snapshot is out. Here are some sample feeds.</summary> <author> <name>Mark Pilgrim</name> <url>http://diveintomark.org/</url> <email>f8...@ex...</email> </author> <contributor> <name>Sam Ruby</name> <url>http://intertwingly.net/blog/</url> <email>ru...@ex...</email> </contributor> <contributor> <name>Joe Gregorio</name> <url>http://bitworking.org/</url> <email>jo...@ex...</email> </contributor> <content type="application/xhtml+xml" mode="xml" xml:lang="en-us"> <div xmlns="http://www.w3.org/1999/xhtml"> <p>The Atom 0.2 snapshot is out. Changes from the <a href="http://intertwingly.net/foo.html">0.1 snapshot</a>:</p> <ol> <li>MAY contain <code>feed/copyright</code>. Free-form copyright statement that applies to the feed (XXX more here). Datatype is <xsd:string>. If present, MUST NOT be blank.</li> <li> <p>MAY contain <code>feed/generator</code>. Datatype is <xsd:string>. If present, MUST be full URI and SHOULD point to the home page of the program which generated the feed.</p> <p>Has optional <code>@name</code> attribute, whose datatype is also <xsd:string>. If present, @name MUST NOT be blank.</p> </li> <li>several other changes not listed here</li> </ol> </div> </content> </entry> </feed> |
From: Bruce D'A. <bd...@fa...> - 2003-12-13 22:44:37
|
On Dec 13, 2003, at 3:40 PM, Markus Hoenicka wrote: > As long as the notes share a common title, common keywords, or link to > the same objects in the database, they will pop up side by side when > running an appropriate query. [...snip...] > If you now run a query that returns the reference "Miller1999", both > notes will be attached to the result. If you search for notes > containing the keyword "something", you'll also get back both notes. > Please note that it is possible for both users to use different but > overlapping sets of keywords. OK, you're starting to convince me. Perhaps there ought to be a way to insure that when titles are the same, they are in fact intended to be so? Mike menu ought to be able to do this with the autocomplete mechanism I imagine... Bruce |
From: Markus H. <mar...@mh...> - 2003-12-13 21:15:53
|
Bruce D'Arcus writes: > I suggested this to Markus privately as an alternative: > > Keep the current structure, but rename "note" to "content," which makes > it all more consistent with RDF, Atom, etc. > And, add a type and xml:lang attribute? Seems to make sense to me. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2003-12-13 21:15:36
|
Bruce D'Arcus writes: > You only allow title, keywords, and users to be associated with an > xnote. This essentially means all of this information is user-level, > and precludes the possibility of using it in a multi-user context. > No. It rather encourages using it in a multi-user context as it does not require that users fiddle with the same xml snippet. > Or is the problem that it is difficult to fit this into a RDBMS > context; that because of *that* there can only be one note? > It's the other way round: fitting it into a RDBMS context does not require to keep the notes together at the XML level. As long as the notes share a common title, common keywords, or link to the same objects in the database, they will pop up side by side when running an appropriate query. Consider these examples: User 1 adds this: <xnote user="user1"> <title>Some Topic</title> <note>This is the User1 opinion.</note> <keyword>something</keyword> <keyword>whatever</keyword> <link type="reference" target="Miller1999"> <xnote> and user 2 adds this, without even knowing that User1 is working on the same topic: <xnote user="user2"> <title>Some Topic</title> <note user="user2">This is the User2 opinion.</note> <keyword>something</keyword> <keyword>nothing</keyword> <link type="reference" target="Miller1999"> <xnote> If you now run a query that returns the reference "Miller1999", both notes will be attached to the result. If you search for notes containing the keyword "something", you'll also get back both notes. Please note that it is possible for both users to use different but overlapping sets of keywords. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Bruce D'A. <bd...@fa...> - 2003-12-13 17:25:33
|
Was just looking at the new weblog standard Atom and been struck by how closely this resembles xnote. <entry xmlns="http://example.com/newformat#" xmlns:lj="some.lj.example.com/namespace/" > <title>My First Entry</title> <subtitle>In which a newbie learns to blog...</subtitle> <summary>A very boring entry...</summary> <author> <name>Bob B. Bobbington</name> <homepage>http://bob.name/</homepage> <weblog>http://bob.blog/</weblog> </author> <issued>2003-02-05T12:29:29</issued> <lj:mood>happy</lj:mood> <content type="application/xhtml+xml" xml:lang="en-us"> <p xmlns="...">Hello, <em>weblog</em> world! 2 < 4!</p> </content> </entry> I suggested this to Markus privately as an alternative: Keep the current structure, but rename "note" to "content," which makes it all more consistent with RDF, Atom, etc. Note, the above example is from an article on Atom extensibility. http://bitworking.org/news/Extending_the_AtomAPI I keep thinking there might be opportunities for the intersection of bib metadata and weblog stuff. A weblog post is just a publicly distributed metadata-attached note, after all, and bibliographic metadata is simply more detailed than that which is typically associated with weblogs. Bruce |
From: Bruce D'A. <bd...@fa...> - 2003-12-13 01:13:47
|
On Dec 12, 2003, at 7:28 PM, Markus Hoenicka wrote: > I've told you previously that this is not going to work. An xnote > element contains exactly one note. The note is the text that you want > to attach to something in the database. The xnote element contains > this note plus the information that you need to manage this note. You've told me, but I'd like other opinions ;-) You only allow title, keywords, and users to be associated with an xnote. This essentially means all of this information is user-level, and precludes the possibility of using it in a multi-user context. Here's what I mean as an example: <xnote> <title>Some Topic</title> <note>This is a a subject a group is working on</note> <note user="user1">User1 adds a note related to it.</note> <note user="user2">User2 follows up.</note> <keyword>something</keyword> <xnote> Or let's say I have the same thing at the user-level, but want to be able to include multiple date-stamped notes within the category? <xnote> <title>Some Topic</title> <note>This is a a subject I am working on</note> <note date="2002-10-12">I add an original note.</note> <note date="2003-12-20">I later add a followup, and want to track the different times.</note> <keyword>something</keyword> <link type="reference" target="one"/> <link type="reference" target="two"/> <link type="reference" target="three"/> <xnote> Or is the problem that it is difficult to fit this into a RDBMS context; that because of *that* there can only be one note? >> The other issue is that whatever we settle on ought to be able to >> standalone, or to be embedded elsewhere. In my case, I embed the >> content in MODS (in its "extension" element). Here's an example: >> >> <extension> >> <xnote xmlns="http://refdb.sourceforge.net/xnotes-ns"> >> <note user="darcusb" date="2003-12-11"> >> <p>The note content</p> >> </note> >> </xnote> >> </extension> >> >> Part of why I like the note-level user info is because it's easier to >> manage when embedding elsewhere (I find it awkward to a) include >> xnoteset in this context and b) declare the namespace in the same >> element as the user). > > You don't need xnoteset in this context. The user information is > attached to the xnote element, not to the xnoteset element. The point is this: To embed xnotes in mods requires declaring the namespace on the root element that is included. You are right I don't need to use xnoteset, but then I end up with: <xnote xmlns="http://refdb.sourceforge.net/xnotes-ns" user="darcusb"> It works, of course, but is simply a little awkward for my tastes. My bigger point is simply in how the structure is conceptualized. >> The datestamp is validated with schema datatyping, and it along with >> the user is auto-inserted with the macro templates I've written. > > This is great as long as you use Schema and a Schema-aware editor. The templates aren't specific to any mode. As for schema and schema aware editors and the larger trajectory of xml development, I'm a young scholar; I'd rather not shackle the work I do with the limitations of decades' old technologies. It's why I prefer RelaxNG for development, nXML for editing (it's the first thing to come along to convince me to learn emacs) and to avoid coding my documents with DTD-specific things like character entities. Bruce |
From: Markus H. <mar...@mh...> - 2003-12-13 00:29:33
|
Bruce D'Arcus writes: > Has anyone out there looked at the new notes functionality that Markus > has added? If not please do. I'm curious if anyone has any feedback. > I second that. I know it's just a prerelease, but I thought this functionality might cause a few people to give it a try. > The idea came from me, and is tied to my interest in xml and > bibliographic annotation. So, Markus, has come up with the basic > structure, while my focus has been more on the micro-markup. We thus > currently have two somewhat different formats; one written as a DTD, > the other as a Relax NG schema. > > Most of the differences are minor, but I still wonder what people think. > > Here's the example in 0.9.4-pre2: > > <xnoteset> > <xnote id="1" key="firstnote"> > <title>myfirstnote</title> > <date><year>2003</year><month>10</month><day>12</day></date> > <note>the note proper</note> > <keyword>biochemistry</keyword> > <keyword>enzymes</keyword> > <link type="reference" target="WANG2002"/> > <link type="reference" target="Phadke1994"/> > <link type="author" target="Walsh,N."/> > <link type="journalabbrev" target="Biochem.Pharmacol."/> > </xnote> > </xnoteset> > > I have argued a few things: > > 1) I don't like all the markup for dates, since this is just a > datestamp that can be automated. > I've noted previously that the date element is a fallback that will probably not be used very often. The current CVS version of RefDB will insert the current date if no date is specified which is most likely what you need anyway. > 2) I prefer changing "author" to "name" to allow linking to non-author > names. > I'm reluctant to change this now as "name" has a somewhat different meaning in risx.dtd. > 3) The bigger conceptual difference is that I was thinking of user > information as tied to the note proper, and not the xnote. Whether I > am right or not depends entirely on how people will use the format. In > my vision, a group could define an xnote with a topic title, and then > individuals could add their own notes within that. > I've told you previously that this is not going to work. An xnote element contains exactly one note. The note is the text that you want to attach to something in the database. The xnote element contains this note plus the information that you need to manage this note. > The other issue is that whatever we settle on ought to be able to > standalone, or to be embedded elsewhere. In my case, I embed the > content in MODS (in its "extension" element). Here's an example: > > <extension> > <xnote xmlns="http://refdb.sourceforge.net/xnotes-ns"> > <note user="darcusb" date="2003-12-11"> > <p>The note content</p> > </note> > </xnote> > </extension> > > Part of why I like the note-level user info is because it's easier to > manage when embedding elsewhere (I find it awkward to a) include > xnoteset in this context and b) declare the namespace in the same > element as the user). > You don't need xnoteset in this context. The user information is attached to the xnote element, not to the xnoteset element. The xnoteset element is a wrapper to allow several xnote elements in a single file that uses the xnote.dtd. If you embed the xnote functionality in a different DTD/Schema, all you need to do is make sure that more than one xnote is allowed. > The datestamp is validated with schema datatyping, and it along with > the user is auto-inserted with the macro templates I've written. > This is great as long as you use Schema and a Schema-aware editor. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Bruce D'A. <bd...@fa...> - 2003-12-12 18:42:30
|
Has anyone out there looked at the new notes functionality that Markus has added? If not please do. I'm curious if anyone has any feedback. The idea came from me, and is tied to my interest in xml and bibliographic annotation. So, Markus, has come up with the basic structure, while my focus has been more on the micro-markup. We thus currently have two somewhat different formats; one written as a DTD, the other as a Relax NG schema. Most of the differences are minor, but I still wonder what people think. Here's the example in 0.9.4-pre2: <xnoteset> <xnote id="1" key="firstnote"> <title>myfirstnote</title> <date><year>2003</year><month>10</month><day>12</day></date> <note>the note proper</note> <keyword>biochemistry</keyword> <keyword>enzymes</keyword> <link type="reference" target="WANG2002"/> <link type="reference" target="Phadke1994"/> <link type="author" target="Walsh,N."/> <link type="journalabbrev" target="Biochem.Pharmacol."/> </xnote> </xnoteset> I have argued a few things: 1) I don't like all the markup for dates, since this is just a datestamp that can be automated. 2) I prefer changing "author" to "name" to allow linking to non-author names. 3) The bigger conceptual difference is that I was thinking of user information as tied to the note proper, and not the xnote. Whether I am right or not depends entirely on how people will use the format. In my vision, a group could define an xnote with a topic title, and then individuals could add their own notes within that. The other issue is that whatever we settle on ought to be able to standalone, or to be embedded elsewhere. In my case, I embed the content in MODS (in its "extension" element). Here's an example: <extension> <xnote xmlns="http://refdb.sourceforge.net/xnotes-ns"> <note user="darcusb" date="2003-12-11"> <p>The note content</p> </note> </xnote> </extension> Part of why I like the note-level user info is because it's easier to manage when embedding elsewhere (I find it awkward to a) include xnoteset in this context and b) declare the namespace in the same element as the user). The datestamp is validated with schema datatyping, and it along with the user is auto-inserted with the macro templates I've written. Bruce |
From: Bruce D'A. <bd...@fa...> - 2003-12-11 21:24:08
|
On Dec 11, 2003, at 3:41 PM, Markus Hoenicka wrote: > Does anyone have real-life examples of names along the lines of > "F. John Smith" in a journal that uses full first and initialized > middle names? This would indeed be the best proof why first and middle > names (or whatever you call them) need to be distinguishable. This business is all a non-issue for me. In my data, I always initialize middle names, because a) I never know the full middle names, and b) I cannot *ever* imagine using -- nor do I recall ever seeing -- a full middle name in any bibliographic entry, regardless of the style. With respect to the above, while I don't have a real world example at hand, I would expect to see this in the bib entry: Smith, F. John In essence, "F. John" is the first/given name. If anything, this supports the argument against middle names, I think. BTW, v3 of MODS adds a new type attribute: termOfAddress. Example: <name type="personal"> <namePart type="termOfAddress">Sir</namePart> <namePart type="given">Arthur</namePart> <namePart type="family">Conan Doyle</namePart> </name> Bruce |
From: Markus H. <mar...@mh...> - 2003-12-11 21:04:01
|
Marc Herbert writes: > I am glad to hear this! Then fix the _automated_ RIS parsing/syntax by > adding a comma to it? > Where would you like to have an additional comma? I'd be reluctant to do this anyway as this would break data import from RefMan and EndNote, but the RIS syntax uses two commas anyway. One to separate the last name from the rest, and one to separate the suffix from the rest. > I did not know publishers of 5000 life sciences journals where so > english-centric and ignorant of foreign cultures. This bug is quite > amazing. > It's sad but I don't see it as my job to change this. > > >=A0I also believe that your argument is moot that if a > > style requires the concept of middle names it should be able to > > retrieve the middle name by itself. With the same argument you could > > dump entirely unparsed strings in any order onto a bib software and > > expect it to figure out how to parse it, as it requires to disginguish > > between given and family names, titles and suffixes. > > If I remember well, this discussion is about the right level of detail > to adopt and where. So I find "With the same argument you could dump > entirely unparsed strings" not very constructive. > I just wanted to point out that it does not make much sense to me to code half-parsed strings in XML when you have to parse anyway. Why not go the extra inch and do it right? > This life-science-specific "middlename parsing" could be factorized > without being put down to the database. So refdb could be used > internationally without bugs and hassles: just by working around it. > Why not adding a "-[no]middlename" option for outputs ? > > Same thing for the "clever" abbreviating code. > The middlename handling and abbreviating stuff is not at your discretion. If a style requires these modifications it does not make any sense to add a switch that will produce incorrect data. > No it's not too late: you can also play the same game with dots and > spaces later at search/formatting time, without subtly and silently > modifying the data that the user intently input; that is losing > information really. > That is, re-parse the name string each time a query comes in? It couldn't come any worse. > Please do never silently and subtly modify user data. At least ask for > confirmation! The real world is too complex for any "clever" names > standardization algorithm. I'll be happy to add a section to the docs in all caps and a red box around it stating that author names will be normalized for the sake of consistency. > OK: I suggest one *extremely* simple improvement to this code: the > ability to disable it, at least at configure time (I will code this > for myself in any case). > This does not make sense as it breaks consistent searching and the bibliography formatting. Otherwise this is an example of the beauty of free software. If you code this for yourself, everyone can have it his way. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |