Thread: [Refdb-devel] latex bibliographies with multiple databases
Status: Beta
Brought to you by:
mhoenicka
From: Markus H. <mar...@mh...> - 2006-07-06 21:59:35
|
David Nebauer writes: > When I specify no database in the document -- with all references from > one database -- and specify that database as a runbib parameter, there > is no problem. If, however, I use the second method of specifying > '\cite{<database>-<reference>}' the process fails. > I'll set up a test case and see what happens. I recall it might be necessary to specify a default database anyway (i.e. use the -d switch of runbib) even if you specify a database in each citation. Does the problem persist if you set a default database? regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-12 07:56:52
|
Hi Markus, > Markus Hoenicka writes: > > \cite{citekey} > > \cite{dbname:citekey} > > Just to let y'all know that the current Subversion version supports > the above mentioned citation format in LaTeX documents. > I'm experiencing all kinds of difficulty using the latest svn refdb build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX format unless citations are in the previous '\cite{[dbname-]IDcitekey}' format. Using the new citation format on my system results in '999:0 retrieved:0 failed'. Would you mind checking on your system? If the new format works for you it must be a problem at my end and I'll work up a test case. Regards, David. |
From: David N. <dav...@sw...> - 2006-07-12 09:43:15
|
Hi Markus, > > One interesting consequence of this is that author names may contain= =20 > > non-ascii characters. If, when new references are added to refdb, t= here=20 > > is no citation key specified, the citekey is constructed by mangling= =20 > > primary author surname and year. If citekey is restricted to ascii = > > characters then non-ascii author surname characters would have to be= =20 > > stripped or converted (e.g., =E4 -> a, =DF -> ss). > > Currently non-ascii characters are simply stripped. You always have > the option to specify a citation key explicitly when adding a > reference, using any reasonable translation of the foreign characters > to ascii. > =20 I'd like to focus on this point again. I personally allow refdb to=20 generate the citekey for me, mainly because it will automatically append = 'a', 'b', etc. if there is danger of duplication. Automatically=20 stripping non-ascii characters from authors with foreign characters will = lead to some unusual results. A recent publication from our old=20 workhorse 'H=E4=DFler' might produce the citekey 'Hler2006'. There are tools around which attempt to convert sensibly from unicode to = ascii. Here is an example using the tool 'konwert': -------------------------------------------------------------------------= -------------- $ cat name H=E4=DFler, G=FCnter $ cat name | konwert UTF8-ascii Hassler, Gunter $ -------------------------------------------------------------------------= -------------- Use of this (or a similar) tool would result in the much more=20 satisfactory, and easy to remember, default citekey of 'Hassler2006'. =20 It should be a fairly simple to add this additional conversion step. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-08-16 21:42:34
|
Hi David, David Nebauer writes: > I'd like to focus on this point again. I personally allow refdb to=20= > generate the citekey for me, mainly because it will automatically ap= pend=20 > 'a', 'b', etc. if there is danger of duplication. Automatically=20 > stripping non-ascii characters from authors with foreign characters = will=20 > lead to some unusual results. A recent publication from our old=20 > workhorse 'H=E4=DFler' might produce the citekey 'Hler2006'. >=20 I've tried to resolve this problem by running the citekeys through an iconv conversion (from UTF-8 to ASCII with transliteration switched on). Only then invalid characters are stripped from the strings. I hope this will improve the automatically created citation keys. iconv uses a latex-style transliteration of umlauts and other non-ASCII characters. E.g. our beloved 'H=E4=DFler' is converted to 'H"assler'. RefDB has to strip the '"' from the latter as it must not appear in XML attribute values, hence you'll end up with 'Hassler2006' instead of the abovementioned 'Hler2006'. It is still not the correct German transliteration, which would call for 'Haessler2006', but I think we're close enough without having to hand-code a boatload of special cases. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-12 12:13:26
|
Hi David, David Nebauer <dav...@sw...> was heard to say: > I'm experiencing all kinds of difficulty using the latest svn refdb > build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX > format unless citations are in the previous '\cite{[dbname-]IDcitekey}' > format. Using the new citation format on my system results in '999:0 > retrieved:0 failed'. Would you mind checking on your system? If the > new format works for you it must be a problem at my end and I'll work up > a test case. I hardly dare to ask, but are you sure you installed the svn version and restarted refdbd? I just checked the svn code, and all changes are in place. The current svn code certainly does not look for "-ID" but for ":" as a database separator. I'm sure that the new format works on my FreeBSD box. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-12 14:17:46
|
Hi Markus, >> I'm experiencing all kinds of difficulty using the latest svn refdb >> build with LaTeX/BibTeX. 'runbib' will not extract records in BibTeX >> format unless citations are in the previous '\cite{[dbname-]IDcitekey}' >> format. >> > I hardly dare to ask, but are you sure you installed the svn version and > restarted refdbd? I install from custom deb packages so refdbd is stopped and started as part of the debian package upgrade process. My source tree is at revision 81. I checked everything, including the source code changes you made at version 72 to alter the citation format. Finally I remembered some advice you gave recently about checking for multiple running instances of refdbd. Sure enough, I had an extra instance running, probably from a debugging exercise where I was running refdbd in standalone mode. Once stopped the old behaviour went away. Problem solved (he says sheepishly). FWIW, I can confirm the new citation format is working correctly. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-12 12:20:19
|
Hi David, David Nebauer <dav...@sw...> was heard to say: > Use of this (or a similar) tool would result in the much more > satisfactory, and easy to remember, default citekey of 'Hassler2006'. > It should be a fairly simple to add this additional conversion step. > It is. Unfortuntately the konwert sources are a bit hard on the eyes because they're in Polish but I'll try to steal from that anyway. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-12 14:22:59
|
Hi Markus, > It is. Unfortuntately the konwert sources are a bit hard on the eyes be= cause > they're in Polish but I'll try to steal from that anyway. A lot of the heavy lifting is done by the (executable) filters. On my=20 system they live in '/usr/share/konwert/filters/'. UTF8-ascii is a bash script: -------------------------------------------------------------------------= ------------- #!/bin/bash - VARIANT_bg=3D' =D0=A9 SHT =D1=89 sht ' VARIANT_de=3D' =C3=84 AE =C3=96 OE =C3=9C UE =C3=A4 ae =C3=B6 oe =C3=BC ue ' VARIANT_hr=3D' =C4=90 DJ =C4=91 dj ' VARIANT_vi=3D' =C3=80 A` =C3=81 A'\'' =C3=82 A^ =C3=83 A~ =C3=88 E` =C3=89 E'\'' =C3=8A E^ =C3=8C I` =C3=8D I'\'' =C3=92 O` =C3=93 O'\'' =C3=94 O^ =C3=95 O~ =C3=99 U` =C3=9A U'\'' =C3=9D Y'\'' =C3=A0 a` =C3=A1 a'\'' =C3=A2 a^ =C3=A3 a~ =C3=A8 e` =C3=A9 e'\'' =C3=AA e^ =C3=AC i` =C3=AD i'\'' =C3=B2 o` =C3=B3 o'\'' =C3=B4 o^ =C3=B5 o~ =C3=B9 u` =C3=BA u'\'' =C3=BD y'\'' =C4=82 A( =C4=83 a( =C4=90 DD =C4=91 dd =C4=A8 I~ =C4=A9 i~ =C5=A8 U~ =C5=A9 u~ ' VARIANT1_bg=3D' =D0=AA Y =D1=8A y ' VARIANT1_ua=3D' =D0=98 Y =D0=B8 y ' REPLACE=3D'?' MIME=3Dus-ascii if [ "$FILTERM" =3D out ] then NPOJED=3D else NPOJED=3D1 fi FORMAT=3D HTMLCHAR=3D POPRAWKI=3D for A in $ARG do case "$A" in (1) NPOJED=3D;; (html) FORMAT=3Dhtml;; (htmldec|htmlhex) FORMAT=3Dhtml; HTMLCHAR=3D${A#html};; (tex) FORMAT=3Dtex;; (*) if [ -x "${0%/*}/../aux/argcharset/$A" ] then POPRAWKI=3D${POPRAWKI:+$POPRAWKI | }${0%/*}/../aux/argcharset/$A fi VARIANT=3DVARIANT_$A; APPROX=3D"${!VARIANT} $APPROX" VARIANT=3DVARIANT1_$A; APPROX1=3D"${!VARIANT} $APPROX1" ;; esac done if [ "$POPRAWKI" ] then "$SHELL" -c "$POPRAWKI" else cat fi | case "$FORMAT" in (html) "${0%/*}/../aux/fixmeta" us-ascii | if [ "$HTMLCHAR" ] then "${0%/*}/UTF8-html$HTMLCHAR" else trs -e '\}\[@&<>\] @' \ ${NPOJED:+-e} ${NPOJED:+"$APPROX"} \ -e "$APPROX1" \ ${NPOJED:+-f} ${NPOJED:+"${0%/*}/../aux/UTF8-ascii"} \ -f "${0%/*}/../aux/UTF8-ascii1" \ -e "\300\-\377 ${REPLACE:-?} \200\-\277 \!" | trs -e '@@ @ @& & @< < @> > & & < < > >' fi ;; (tex) trs -e '\}\[@\#$%&\\^_{|}~\] @' \ -f "${0%/*}/../aux/UTF8-tex" \ -e "$APPROX" \ -e "$APPROX1" \ -f "${0%/*}/../aux/UTF8-ascii" \ -f "${0%/*}/../aux/UTF8-ascii1" \ -e "\300\-\377 ${REPLACE:-?} \200\-\277 \!" | trs -e '@@ @ @\# \# @$ $ @% % @& & @\\ \\ @^ ^ @_ _ @{ { @| | @} } @~ ~ \# \\\# $ \\$ % \\% & \\& \\ $\\backslash$ ^ \\^{} _ \\_ { \\{ | $|$ }=20 \\} ~ \\~{}' ;; (*) trs ${NPOJED:+-e} ${NPOJED:+"$APPROX"} \ -e "$APPROX1" \ ${NPOJED:+-f} ${NPOJED:+"${0%/*}/../aux/UTF8-ascii"} \ -f "${0%/*}/../aux/UTF8-ascii1" \ -e "\300\-\377 ${REPLACE:-?} \200\-\277 \!" ;; esac -------------------------------------------------------------------------= ------------- There's bash wizardry in there I can't even begin to fathom. Regards, David. |
From: David N. <dav...@sw...> - 2006-07-24 10:40:27
|
Hi Markus, Markus Hoenicka wrote: > Actually we can still use ISO-8859-1 or whatever as the RIS input > format as refdbd internally converts it to UTF-8 if the database uses > this encoding. Forcing UTF-8 for RIS data actually makes only sense if > people use both bibtex and RIS data. The advantage of making UTF-8 the default encoding at each step in the life cycle is you don't have to remember when to specify UTF-8 encoding and when not to. Still, the most important thing is making sure to document clearly what the default encoding is at each step so the user knows what is happening. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-24 10:57:02
|
David Nebauer <dav...@sw...> was heard to say: > Hi Markus, > > Markus Hoenicka wrote: > > Actually we can still use ISO-8859-1 or whatever as the RIS input > > format as refdbd internally converts it to UTF-8 if the database uses > > this encoding. Forcing UTF-8 for RIS data actually makes only sense if > > people use both bibtex and RIS data. > > The advantage of making UTF-8 the default encoding at each step in the > life cycle is you don't have to remember when to specify UTF-8 encoding > and when not to. Still, the most important thing is making sure to > document clearly what the default encoding is at each step so the user > knows what is happening. > This is approximately what I intended to say, but I guess I was too tired to be as clear as I should. I plan to set all defaults to UTF-8, but users are still free to configure their systems differently if they have a good reason. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-06 22:42:45
|
Markus Hoenicka writes: > I'll set up a test case and see what happens. I recall it might be > necessary to specify a default database anyway (i.e. use the -d switch > of runbib) even if you specify a database in each citation. Does the > problem persist if you set a default database? > Upon checking the code I noticed that I had to disable support for using more than one database for some reason. It seems to be related to the fact that I wanted to use the citation key as reference without an ID prefix or something. However, this way I can't safely separate a database part from the citation key proper. I'll have to figure out something how I can make this work again. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-07 00:02:34
|
Hi Markus, > Upon checking the code I noticed that I had to disable support for > using more than one database for some reason. > I'll have to figure out something how I can make this work again. This would be a *very* nice feature to have. For what it's worth, however, forcing the use of a single database per document would simply bring LaTeX support in line with DocBook document support. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-07 07:01:00
|
David Nebauer <dav...@sw...> was heard to say: > This would be a *very* nice feature to have. For what it's worth, > however, forcing the use of a single database per document would simply > bring LaTeX support in line with DocBook document support. > I didn't try lately, but the reverse might be true. Once upon a time the DocBook/TEI code also allowed using more than one database. I'll check. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-07 23:12:47
|
Markus Hoenicka writes: > I didn't try lately, but the reverse might be true. Once upon a time the > DocBook/TEI code also allowed using more than one database. I'll check. > I've checked the situation with DocBook and TEI. Both support database names in citations using the full format. refdbxp does not support database names, presumably because the short citation format has no means to encode a database name. I've fixed refdbd to support multiple databases in bibtex too. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-08 05:25:36
|
Hi Markus, > I've fixed refdbd to support multiple databases in bibtex too. Now it is not working for me at all. runbib is retrieving no information. Here is an annotated transcript: ------------------------------------------------------------------------------------------ >> Here are the tex-related files << $ ls -rw-r--r-- 1 david david 155 2006-07-08 14:13 test.aux -rw-r--r-- 1 david david 275 2006-07-08 14:06 test.bbl -rw-r--r-- 1 david david 1 2006-07-08 14:22 test.bib -rw-r--r-- 1 david david 914 2006-07-08 14:06 test.blg -rw-r--r-- 1 david david 696 2006-07-08 14:13 test.dvi -rw-r--r-- 1 david david 3017 2006-07-08 14:13 test.log -rw-r--r-- 1 david david 480 2006-07-08 14:06 test.tex -rw-r--r-- 1 david david 178 2006-07-06 15:36 test.tex~ >> Here are the references in the tex document << $ cat test.tex % File: test.tex % Created: Thu Jul 06 03:00 PM 2006 C % Last Change: Thu Jul 06 03:00 PM 2006 C % \documentclass[a4paper]{article} \usepackage{natbib} \author{David Nebauer} \title{Test Document} \begin{document} \maketitle \section{Introduction} This is a test of the RefDB application used in conjunction with vim-latexsuite. Here is a reference \cite{Agnew0}. Here is another \cite{Weckert0}. \bibliographystyle{plainnat} \bibliography{test} \end{document} >> Let me prove the references exist << $ refdbc -C getref -d refs_computing :CK:=Agnew0 ID*:17 (2000) Key: Agnew0 Agnew,Grace Government Access to Encryption Keys 999:1 retrieved:0 failed $ refdbc -C getref -d refs_computing :CK:=Weckert0 ID*:23 (1997) Key: Weckert0 Weckert,J. Intellectual Property Rights and Computer Software Business Ethics: A European Review 6(2):102-109 999:1 retrieved:0 failed >> Let me show the test.aux file includes those references << $ cat test.aux \relax \citation{Agnew0} \citation{Weckert0} \bibstyle{plainnat} \bibdata{test} \@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}} >> Here is the runbib command that returns no results << $ runbib -d refs_computing -S bibtex-full -t bibtex test 999:0 retrieved:0 failed >> The corresponding refdbib command also returns nothing << $ refdbib -d refs_computing -S bibtex-full -t bibtex test.aux > test.bib 999:0 retrieved:0 failed >> The test.bib file remains empty! << $ cat test.bib $ ------------------------------------------------------------------------------------------ I ran refdbd standalone at log setting 7. Here is the feedback generated when running the runbib or refdbib command as above: ------------------------------------------------------------------------------------------ adding client 127.0.0.1 on fd 5 server waiting n_max_fd=5 try to read from client serving client on fd 5 with protocol version 4 012-58-51-27 send pseudo-random string to client parent removing client on fd 5 server waiting n_max_fd=4 gettexbib -u david -w xxxxxxxxxxxxxxxxxxxxxxxxxxx -d refs_computing -s bibtex-full 19 dbi is up localhost david daviduser refs_computing sqlite /var/lib/refdb/db refdb connected to database server using database: refdb Main database looks ok: refdb localhost david daviduser refs_computing sqlite /var/lib/refdb/db refs_computing SELECT meta_app,meta_type,meta_dbversion from t_meta connected to database server using database: refs_computing command processing done, finish dialog now child finished client on fd 5 child exited with code 0 server waiting n_max_fd=4 ------------------------------------------------------------------------------------------ I'm not familiar with the 'gettexbib' command. The '19' initially looked a little strange but I had a quick dive into refdbib.c and it looks like that is a legitimate parameter -- the command buffer string length. Adding the database name to each reference as per the manual makes no difference. Regards, David. P.S. The refdb-users lists is rejecting my posts sporadically with the claim my ISP is not providing a postmaster address, so I'm copying all my posts to your personal email address. |
From: Markus H. <mar...@mh...> - 2006-07-08 09:28:29
|
David Nebauer <dav...@sw...> was heard to say: > This is a test of the RefDB application used in conjunction with > vim-latexsuite. Here is a reference \cite{Agnew0}. Here is another > \cite{Weckert0}. > Before my latest patch, these citations probably did work. I've implemented this version a while ago to move to a citation syntax familiar to LaTeX users, i.e. use the citation key in curly brackets. However, this does not allow a safe distinction of citation keys with and without a database part. In order to support multiple databases I had to revert this to the original format where the citation key is prefixed with "ID" or "dbname-ID". The following is supposed to work: This is a test of the RefDB application used in conjunction with vim-latexsuite. Here is a reference \cite{IDAgnew0}. Here is another \cite{otherdb-IDWeckert0}. I'm open for suggestions if you know a better way to safely distinguish database names from the citation key proper. While testing the code I came across a problem with bibliography entries which contain ampersands. The ampersand seems to be a control character in LaTeX/bibtex and needs to be escaped in the bibtex output. I'll look into this shortly. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-08 11:24:53
|
Hi Markus, > In order to > support multiple databases I had to revert this to the original format where > the citation key is prefixed with "ID" or "dbname-ID". > > I'm open for suggestions if you know a better way to safely distinguish database > names from the citation key proper. > I'm afraid the current scheme is unusable. It is possible for citation keys to contains hyphens -- in fact, the default key for a hyphenated author surname contains a hyphen. Try using a hyphenated citation key and watch the fun. Even better, combine a database name with the hyphenated citation key -- that introduces another hyphen. Even more fun. refdb allows you to create databases whose names contain hyphens. I haven't tried including a hyphenated database with a hyphenated citation key in the one citation -- I'm too scared. Interestingly, while I can create a database with a hyphenated name I can't delete it (at least with an sqlite backend) -- the deletedb operation fails. IIRC, it is illegal to include colons in citation keys. I seem to recall they are automatically stripped out. It is currently possible to include hyphens in database names. But, if you made it illegal to include a hyphen in a database name that gives you a ready-made delimiter to use in citations. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-08 12:14:07
|
David Nebauer <dav...@sw...> was heard to say: > I'm afraid the current scheme is unusable. It is possible for citation > keys to contains hyphens -- in fact, the default key for a hyphenated > author surname contains a hyphen. Try using a hyphenated citation key > and watch the fun. Even better, combine a database name with the > hyphenated citation key -- that introduces another hyphen. Even more fun. > That's why the code does not rely on the hyphen as a separator, but on the sequences "ID" and "-ID", which are checked for in this particular order from left to right. Unless a citation is malformed, you can have as many hyphens or even "-ID" sequences in your citation keys as you like: \cite{IDMILLER-IDRUM-2005} ** \cite{dbname-IDMILLER-IDRUM-2005} *** The '*' mark the database name prefix separator in both cases. Unless I'm dense this is foolproof as far as citation keys are concerned. Trouble may arise when you use database names like "IDBASE" or "DATA-IDBASE". RefDB would have to reject these names in order to avoid trouble. > refdb allows you to create databases whose names contain hyphens. I > haven't tried including a hyphenated database with a hyphenated citation > key in the one citation -- I'm too scared. Interestingly, while I can > create a database with a hyphenated name I can't delete it (at least > with an sqlite backend) -- the deletedb operation fails. I'll have to investigate this. SQLite databases are deleted on the filesystem level by using an unlink() system call - I can't imagine why that would fail with a hyphen in the filename. > > IIRC, it is illegal to include colons in citation keys. I seem to > recall they are automatically stripped out. It is currently possible to > include hyphens in database names. But, if you made it illegal to > include a hyphen in a database name that gives you a ready-made > delimiter to use in citations. > Yes, colons are not allowed in IDREF attributes (xref linkend). And yes, refdbd indeed strips out colons in citation keys to avoid creating invalid output. The remainder of your suggestion is less clear to me. If I understand correctly, you suggest to use the hyphen under the assumption that database names never have hyphens (this could indeed be enforced). But how do you distinguish between a citation key prefixed with a database name and a sole citation key containing a hyphen? As in: dbname-citekey cite-key regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-08 14:21:51
|
David Nebauer <dav...@sw...> was heard to say: > IIRC, it is illegal to include colons in citation keys. I seem to > recall they are automatically stripped out. It is currently possible > to include colons in database names. But, if you made it illegal to > include a colon in a database name that gives you a ready-made > delimiter to use in citations. > > > I think that suggestion makes more sense. > Well, if I understand *that* correctly, you suggest to use these forms in LaTeX documents: \cite{citekey} \cite{dbname:citekey} This is more compact than the "-ID" kludge that I currently use. The only downside, if at all, is that this citation syntax is different from the one used in the full style in SGML/XML documents (where, as noted previously, we must not use a colon). However, this might only confuse those who work with both LaTeX and SGML/XML. Unless someone else has objections, I'll implement your suggestion shortly. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-08 22:23:16
|
Markus Hoenicka writes: > \cite{citekey} > \cite{dbname:citekey} > Just to let y'all know that the current Subversion version supports the above mentioned citation format in LaTeX documents. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-08 14:35:03
|
David Nebauer <dav...@sw...> was heard to say: > From one of the many guides on LaTeX: > > You can use any of the standard characters that you find on your > keyboard, except the following 10 symbols: > { } % & $ # _ ^ ~ \ > These symbols may only occur in LATEX commands. > > > The first seven of the characters shown above are included as literals > by escaping them with a backslash. Thanks for that. > There's something else. You may recall some time ago all the trouble > taken to ensure entities such as — and & are preserved in > database reference entries and subsequently then preserved throughout > DocBook processing. Many of my references include entities in document > titles. Well, those entities are now appearing in the bibtex entries > created by runbib. As you noted, the raw ampersands choke LaTeX. Is > there any way of converting those xml-safe entities to LaTeX equivalents > as runbib exports them? In the case of '—' that would be '---'. > I don't think this is too hard. I'll have to adapt the character replacement routine which currently converts the input to XML-safe strings to replace the entities back to something that LaTeX can grok. Most of the work is to compile a table that defines the required translations. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-08 22:21:12
|
David Nebauer writes: > There's something else. You may recall some time ago all the trouble > taken to ensure entities such as — and & are preserved in > database reference entries and subsequently then preserved throughout > DocBook processing. Many of my references include entities in document > titles. Well, those entities are now appearing in the bibtex entries > created by runbib. As you noted, the raw ampersands choke LaTeX. Is > there any way of converting those xml-safe entities to LaTeX equivalents > as runbib exports them? In the case of '—' that would be '---'. > I thought about this a bit more. I'm afraid this is going to get far more complex than I thought in the first place. We need to: - replace XML entities that stem from risx documents or which were deliberately used in RIS data. E.g. '—' -> '---' - backslash-escape LaTeX command characters unless, and that's the catch, they are used as LaTeX commands. A LaTeX-only user may rightfully expect e.g. author names like 'H\"{a}\{ss}ler' (as imported from a bibtex file) to be processed correctly, or e.g. '{\bf emphasized}' words in titles. refdbd would have to acquire a thorough knowledge of LaTeX commands to cope with this. - translate foreign letters and letters with diacritics to their TeX equivalents from, and that's the catch here, any supported character encoding. The same TeX representation of such a letter may be encoded as a variety of one to three-byte sequences in different uni- or multibyte character sets. Is anyone aware of a library or a tool that implements these transformations? I'm only aware of tex2mail which does the reverse. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-10 21:46:40
|
Hi David, David Nebauer writes: > I've also been thinking further about this. It seems to me the issu= e is=20 > what formats to use when inputting and outputting: >=20 >=20 > ------------> output for XML > input data | > -------------> STORAGE -------| > FORMAT | > ------------> output for BibTeX/LaT= eX >=20 >=20 > A simple scheme seems to me to consist of the following: > - input and store all data as unicode > - during output perform needed translations >=20 This may be a necessary tradeoff. So far, the LaTeX diehards were able to use many LaTeX constructs in e.g. the author names or the titles (think of italics, superscripts, or subscripts which are not uncommon in e.g. physics or chemistry papers). This may be a pain in the neck to search afterwards, but at least you could do it. If we follow the simplified scheme you outline above, you can no longer use these LaTeX hacks. I'd like to hear from the LaTeX users (I'm not one of them currently) how important it is to include LaTeX markup into the data.=20= > The *minimum* necessary translations needed are: >=20 > 1. BibTeX/LaTeX >=20 > Convert the following control characters to appropriate escape seque= nces: > # $ % & ~ =5F ^ \ { } >=20 I'm afraid there's more to it. We have to remove lots of commands like the above mentioned boldface, italics, superscript, subscript and such. These commands do not make any sense in the context of SGML/XML. We also have to translate foreign characters (\"{a}, {\ss} an= d similar constructs. Part of this translation can be achieved through tex2mail, although it does not seem to create UTF-8 (but see below). > 2. XML/SGML >=20 > Convert the two illegal characters to their respective entities: > & < >=20 > While not illegal, it is customary also to convert the following cha= racters: > > ' " I was under the impression that &, <, and > always have to be replaced as these are part of the XML markup. Why and to what would you like to convert ' and "=3F >=20 >=20 > The question then arises as to whether any other translation is=20 > necessary and/or desirable. In theory no other translation is=20 > necessary. LaTeX can process "raw" unicode using the 'ucs' package.= =20 This is good news. My only LaTeX book dates back to 1999, and Unicode does not seem to be mentioned. The transformations would be so much simpler if we didn't have to create LaTeX commands to represent foreign= or special characters. > The XML standard states, "Legal characters are tab, carriage return,= =20 > line feed, and the legal graphic characters of Unicode and ISO/IEC 1= 0646". >=20 This is pretty much what RefDB currently outputs. > Having said that, it may be desirable to translate non-ascii charact= ers=20 > into decimal numeric character references (e.g., 'â') for XML o= r,=20 > for LaTeX, appropriate escape sequences. Perhaps this could be opti= onal=3F >=20 I think it is common to leave the non-ascii characters in the xml file and use the proper charset declaration (UTF-8 by default). IMHO character entities do not have any advantage over UTF-8. I'm not sure about LaTeX output. How hard is it to make the use of the ucs package mandatory for RefDB users=3F Once it is installed, it is as simple as inserting one line at the top of your document, isn't it=3F > One interesting consequence of this is that author names may contain= =20 > non-ascii characters. If, when new references are added to refdb, t= here=20 > is no citation key specified, the citekey is constructed by mangling= =20 > primary author surname and year. If citekey is restricted to ascii=20= > characters then non-ascii author surname characters would have to be= =20 > stripped or converted (e.g., =E4 -> a, =DF -> ss). >=20 Currently non-ascii characters are simply stripped. You always have the option to specify a citation key explicitly when adding a reference, using any reasonable translation of the foreign characters to ascii. > escapechars converts non-ASCII (UTF-8, Latin-1 etc.) files to= =20 > ASCII with XML or TeX > escape sequences > latex2utf8txt converts LaTeX files to UTF-8 text, removes line=20= > breaks from paragraphs Thanks for the pointers. I've downloaded these scripts and will give them a try. If the latter works as advertized, it could be used as a post-processing filter after bib2ris (or, if I'll ever end up having too much time on my hands, I could reimplement bib2ris in Perl and integrate the conversion code). The former is a bit trickier as the conversion should run in refdbd. However, the script looks simple enough that I might be able to recode the algorithm in C. regards, Markus --=20 Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-11 15:54:02
|
Hi David, David Nebauer <dav...@sw...> was heard to say: > If you instead go with my idea to store as unicode you don't need to > know anything about the eventual output format when you store the > reference. Indeed, the user doesn't have to know at that time. The > same references can be used for either DocBook or LaTeX. You can easily > add in other output formats later and all you have to do is write > another output filter. > I'm all with you here. I just wanted to get opinions from real-world LaTeX users whether or not it makes sense to preserve the markup. > Your point is true but I say it is a small loss. Using LaTeX formatting > codes means your references can never be used for any other format > without hacking in some kind of conversion. RefDB is designed to be a > long-term reference database enabling the contained references to be > used all kinds of interesting ways. Use of format-specific markup > limits your future choices. As a minor example it prevents their use in > DocBook documents. True, but I assumed that only those might want to keep the markup who use RefDB solely for LaTeX. > > Another issue is the ability of library and indexing systems to handle > such formatting complexities as superscripting, subscripting and font > changes. You know far more about such things than I, but I would guess > even the most complex article title is reduced to canonical ascii for > storage in many cataloguing systems. I presume the algorithms for such > simplification are fairly predictable. Anyone searching for the journal > article by title would be easily able to predict the stored character > sequence. I would endeavour to suggest the simplified form of title > would be entirely acceptable in any kind of bibliography. > > In any event, how would such a complex title be stored in plain ascii? > Or Unicode? Or even XML (imagine the attempt to use MathML in a title > string!)? > The database which I use mostly (www.pubmed.org) indeed "ascii-izes" the titles. The tagged format uses plain ASCII with a pretty crude transliteration, whereas the XML format uses Unicode. > As mentioned above, I am unconvinced about the utility of keeping > boldface, italics, superscript and subscript-type markup. As for > foreign characters, almost any foreign character can be represented in I'm afraid I didn't express my thoughts very well here. What I was talking about is that a reference imported from bibtex may contain markup like "Title with an {\bf emphasized} word" It is not sufficient to escape characters but we have to remove the "{\bf " and the "}" sequences before we import the reference. This is what one of the scripts that you pointed me to as well as tex2mail do. > To allow attribute values to contain both single and double quotes, > the apostrophe or single-quote character (') may be represented as > "'", and the double-quote character (") as """. > > > The relevant portion states, "The right angle bracket (>) *may* be > represented using the string '>'," but "*must*, for compatibility, be > escaped using '>' or a character reference when it appears in the > string ']]>'." (emphases mine) > > The last paragraph in the quote refers to straight single and double > quotation mark entities. > But it appears to talk about attribute values. XML output from RefDB never puts quotes into attribute values, so we're left with &,<,>. > It worked for me "out of the box". I installed the 'ucs' package > (apt-get install latex-ucs), added those two lines to the preamble, ran > 'latex test' and, presto, gloriously rendered unicode. > This is great news indeed. I will have to mention this in the manual I take from this discussion: 1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such code) which strips markup like boldface, superscript etc. and translates foreign characters entered as LaTeX constructs to their Unicode equivalents. 2) Modify the code to prevent XML entities to show up in LaTeX output. 3) Add code to escape the LaTeX command characters in the LaTeX output. The second point is a bit tricky. References imported from RIS usually do not contain entities, but references imported from risx are likely to do. Either I convert these entities during import, or I remove them during LaTeX export. The former seems cleaner to me, and I think this is what you had in mind. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-11 16:37:20
|
Hi Markus, > I take from this discussion: > > 1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such > code) which strips markup like boldface, superscript etc. and translates > foreign characters entered as LaTeX constructs to their Unicode equivalents. > > 2) Modify the code to prevent XML entities to show up in LaTeX output. > > 3) Add code to escape the LaTeX command characters in the LaTeX output. > > The second point is a bit tricky. References imported from RIS usually do not > contain entities, but references imported from risx are likely to do. Either I > convert these entities during import, or I remove them during LaTeX export. The > former seems cleaner to me, and I think this is what you had in mind. Yes, in my view the storage format is Unicode without markup: BibTeX ------- ---------> DocBook | | | | RIS ---------+--> STORAGE ---- | (Unicode) | | | RISX --------- ---------> LaTeX Whatever the input format, all references end up in the same storage format (Unicode sans markup). This would require stripping out XML entities and LaTeX markup. With luck you can use existing tools to do this. The stored references can then be output in either DocBook- or LaTeX-compatible format. This seems to be to be an elegant way of dealing with the mishmash of input and output formats. Regards, David. |