Thread: [Refdb-users] Parsing error ? with addref
Status: Beta
Brought to you by:
mhoenicka
From: Daniel O'D. <dan...@ul...> - 2006-06-24 19:04:21
|
I'm getting what looks like an error produced with addref or getref: I'm importing about 2900 references using addref. If I then output to ris, I'm finding that keywords seem to be mixed up a little. What seems to be happening is that short (one character or word) and unique (one occurrence in the original data) keywords are showing up randomly and repeatedly assigned to other entries. These other entries still keep their original keywords. Here's an example: 1) Original RIS > TY - JOUR > A1 - Abraham, Lenore > T1 - Cædmon's Hymn and the Geþwærnysse (fitness) of things > JO - ABR > JF - American Benedictine Review > Y1 - 1992/// > VL - 43 > SP - 331 > EP - 44 > N2 - folder. 27-ii-01: notes in folder > N1 - <Fairly long note removed/> > AV - Kakelbont Article Abraham1992 > KW - chapter 6 > KW - chapter 1 > KW - Dawn > KW - chapter 6 > KW - chapter 1 > KW - Dawn > KW - caedmon's Hymn > KW - structure > KW - literary criticism > KW - bede > KW - Historia ecclesiastia > KW - internal criticism > ER - Here's what is coming out the other end: > refdbc: getref -d refdbib -t ris :AU:~^Abraham > > TY - JOUR > ID - ABRAHAM1992 > AU - Abraham,Lenore > TI - Cædmon's Hymn and the Geþwærnysse (fitness) of things > JF - American Benedictine Review > JO - ABR. > KW - chapter 6 > KW - chapter 1 > KW - Dawn > KW - chapter 6 > KW - chapter 1 > KW - Dawn > KW - caedmon's Hymn > KW - structure > KW - literary criticism > KW - bede > KW - Historia ecclesiastia > KW - internal criticism > KW - Cædmon's Hymn > KW - l > KW - m > KW - o > KW - ld > KW - h > KW - æ > KW - Cædmon > VL - 43 > SP - 331 > EP - 44 > N2 - folder. 27-ii-01: notes in folder > RP - NOT IN FILE > AV - Kakelbont Article Abraham1992 > N1 - <fairly long note/> > PY - 1992/// > ER - > 999:1 retrieved:0 failed > refdbc: > The problem is the following keyword section KW - l > KW - m > KW - o > KW - ld > KW - h > KW - æ > KW - Cædmon These are all unique keywords drawn from other entries in the original RIS file. They show up multiple times in the refdb generated ris files. The presence or absence of N2 doesn't affect anything. I don't know if the long note does, though the same problem occurs with entries that have no notes. A random selection of these extra keywords seems to be attached to every entry in the output RIS. -d -- Daniel Paul O'Donnell Associate Professor and Chair of English Director, Digital Medievalist Project <http://www.digitalmedievalist.org/> University of Lethbridge Lethbridge AB T1K 3M4 Canada Vox +1 403 329-2377 Fax +1 403 382-7191 :@caedmon/ubuntu |
From: Rich S. <rsh...@ap...> - 2006-06-24 20:44:32
|
On Sat, 24 Jun 2006, Daniel O'Donnell wrote: > I'm importing about 2900 references using addref. If I then output to ris= , > I'm finding that keywords seem to be mixed up a little. What seems to be > happening is that short (one character or word) and unique (one occurrenc= e > in the original data) keywords are showing up randomly and repeatedly > assigned to other entries. These other entries still keep their original > keywords. Dan, Since I'm another brand-new user, I probably don't know what I'm writing about, but I'll point out what I don't think is correct, then we can both learn some more about RefDB. As far as the keyword order is concerned, that's a PostgreSQL thing. Row= s (tuples) are returned from a query in no particular order. Unless the query has an ORDERED BY clause, we'll see this every time. It's not any sort of a bug or issue of concern. > 1) Original RIS >> KW - chapter 6 >> KW - chapter 1 >> KW - Dawn >> KW - chapter 6 >> KW - chapter 1 >> KW - Dawn Why the duplicate entries in the original? > Here's what is coming out the other end: >> refdbc: getref -d refdbib -t ris :AU:~^Abraham ^^^^^^^^^^ Unless you have multiple databases with entries, you don't need to speci= fy the name. What I don't see in your command line is the option to write the returned records to a file, e.g., '-o hereiam.ris'. I wonder if you'll see the same output in a file that you see on screen, > The problem is the following keyword section > > KW - l >> KW - m >> KW - o >> KW - ld >> KW - h >> KW - =E6 >> KW - C=E6dmon > These are all unique keywords drawn from other entries in the original > RIS file. They show up multiple times in the refdb generated ris files. Are the encodings for all databases the same? Do you really have single letter keywords on some entries? This looks as strange and confusing as my problems with keywords returne= d by getref. But, in my case it was caused by asking for an exact match rathe= r than a 'like' match. Rich --=20 Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863 |
From: Markus H. <mar...@mh...> - 2006-06-24 22:02:49
|
Rich Shepard writes: > As far as the keyword order is concerned, that's a PostgreSQL thi= ng. Rows > (tuples) are returned from a query in no particular order. Unless th= e query > has an ORDERED BY clause, we'll see this every time. It's not any so= rt of a > bug or issue of concern. >=20 This is correct. It applies to all database engines as the SQL standard does not mandate a particular order of the returned datasets unless the ORDERED BY clause is used. RefDB does not use this clause here as the order of keywords is not relevant in RIS. It uses the clause for author names as their order in the RIS dataset is relevant. > the name. What I don't see in your command line is the option to wri= te the > returned records to a file, e.g., '-o hereiam.ris'. I wonder if you'= ll see > the same output in a file that you see on screen, >=20 The output is the same except for the summary which is sent to stderr. You see it on the screen as it displayes both stuff sent to stdout (the data) and to sterr (the summary). If you send the output to a file, it will contain the part sent to stout only. > > The problem is the following keyword section > > > > KW - l > >> KW - m > >> KW - o > >> KW - ld > >> KW - h > >> KW - =E6 > >> KW - C=E6dmon >=20 > > These are all unique keywords drawn from other entries in the orig= inal > > RIS file. They show up multiple times in the refdb generated ris f= iles. >=20 This is caused by the automatic keyword scan which is turned on by default. RefDB scans the titles and abstracts of new entries for keywords already known to the database. This is a very useful feature in most cases. However, if you indeed use single-letter keywords, the purpose of the automatic keyword scan is pretty much defeated as almost all entries will end up containing these keywords. If having single-letter keywords is indeed useful and necessary for you, you should switch off the automatic keyword scan by setting keyword=5Fscan f in /usr/local/etc/refdb/refdbrc. BTW the RefDB handbook contains a section called "Input data mangling" which explains how RefDB may alter your data. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Rich S. <rsh...@ap...> - 2006-06-24 23:37:30
|
On Sun, 25 Jun 2006, Markus Hoenicka wrote: > The output is the same except for the summary which is sent to > stderr. You see it on the screen as it displayes both stuff sent to > stdout (the data) and to sterr (the summary). If you send the output > to a file, it will contain the part sent to stout only. Markus, Good to know. > This is caused by the automatic keyword scan which is turned on by default. > RefDB scans the titles and abstracts of new entries for keywords already > known to the database. This is a very useful feature in most cases. > However, if you indeed use single-letter keywords, the purpose of the > automatic keyword scan is pretty much defeated as almost all entries will > end up containing these keywords. If having single-letter keywords is > indeed useful and necessary for you, you should switch off the automatic > keyword scan by setting > > keyword_scan f > > in /usr/local/etc/refdb/refdbrc. Thanks for the lesson. Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863 |
From: Daniel O'D. <dan...@ul...> - 2006-06-25 00:52:35
|
Thanks Markus and Rich, The single letter keywords are partially a legacy data issue and partially disciplinary. In manuscript and textual studies some manuscripts are known by 1 or 2 letter sigla: Corpus Christi College Manuscript 41 is known as b1 to most Anglo-Saxonists. Also you could have an article about the Anglo-Saxon word æ. In previous databases, we keyworded things like this because searching all fields was more difficult. My plan in refdb is to rationalise things. The multiple keywords is also a legacy of some script a student ran through once. I'd read the data mangling section but missed the significance of what refdb was doing. I'll turn it off until the data is in better shape, I think! -d On Sun, 2006-25-06 at 00:02 +0200, Markus Hoenicka wrote: > Rich Shepard writes: > > As far as the keyword order is concerned, that's a PostgreSQL thing. Rows > > (tuples) are returned from a query in no particular order. Unless the query > > has an ORDERED BY clause, we'll see this every time. It's not any sort of a > > bug or issue of concern. > > > > This is correct. It applies to all database engines as the SQL > standard does not mandate a particular order of the returned datasets > unless the ORDERED BY clause is used. RefDB does not use this clause > here as the order of keywords is not relevant in RIS. It uses the > clause for author names as their order in the RIS dataset is relevant. > > > the name. What I don't see in your command line is the option to write the > > returned records to a file, e.g., '-o hereiam.ris'. I wonder if you'll see > > the same output in a file that you see on screen, > > > > The output is the same except for the summary which is sent to > stderr. You see it on the screen as it displayes both stuff sent to > stdout (the data) and to sterr (the summary). If you send the output > to a file, it will contain the part sent to stout only. > > > > The problem is the following keyword section > > > > > > KW - l > > >> KW - m > > >> KW - o > > >> KW - ld > > >> KW - h > > >> KW - æ > > >> KW - Cædmon > > > > > These are all unique keywords drawn from other entries in the original > > > RIS file. They show up multiple times in the refdb generated ris files. > > > > This is caused by the automatic keyword scan which is turned on by > default. RefDB scans the titles and abstracts of new entries for > keywords already known to the database. This is a very useful feature > in most cases. However, if you indeed use single-letter keywords, the > purpose of the automatic keyword scan is pretty much defeated as > almost all entries will end up containing these keywords. If having > single-letter keywords is indeed useful and necessary for you, you > should switch off the automatic keyword scan by setting > > keyword_scan f > > in /usr/local/etc/refdb/refdbrc. > > BTW the RefDB handbook contains a section called "Input data mangling" > which explains how RefDB may alter your data. > > regards, > Markus > -- Daniel Paul O'Donnell Associate Professor and Chair of English Director, Digital Medievalist Project <http://www.digitalmedievalist.org/> University of Lethbridge Lethbridge AB T1K 3M4 Canada Vox +1 403 329-2377 Fax +1 403 382-7191 :@caedmon/ubuntu |