Thread: Re: [Refdb-devel] latex bibliographies with multiple databases (Page 2)
Status: Beta
Brought to you by:
mhoenicka
From: Damien J. D. <D.J...@cs...> - 2006-07-11 17:57:38
|
Gidday As a novice latex user it seems okay; I can't envisage any difficulties. Probably jumping the gun a bit here, but I'd like to see the following kind of latex markup reproduced (not stripped entirely): BIBTEX: TITLE = {Why {AM} and {EURISKO} appear to work}, Current RIS: TI - Why {AM} and {EURISKO} appear to work Current RISX: <title type="full">Why {AM} and {EURISKO} appear to work</title> The reason is that bibtex will capitalise your bibliography according to the current bibliography style (generally with a single leading capital), and you need to inform bibtex that certain passages are to be formatted verbatim. Other than that, I can't envisage the need for keeping latex markup - I've been manually stripping everything else. I enter most of my references using risx (I use bibtex and RIS where provided but usually edit them manually before submitting them to RefDB), and I don't appear to have any entites coming back (probably because I don't know how to use entities). Except I think entities are used in some of the URLs - I don't know if they're necessary or not, they're just there. Peace Damien David Nebauer wrote: > Hi Markus, > > >>I take from this discussion: >> >>1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such >>code) which strips markup like boldface, superscript etc. and translates >>foreign characters entered as LaTeX constructs to their Unicode equivalents. >> >>2) Modify the code to prevent XML entities to show up in LaTeX output. >> >>3) Add code to escape the LaTeX command characters in the LaTeX output. >> >>The second point is a bit tricky. References imported from RIS usually do not >>contain entities, but references imported from risx are likely to do. Either I >>convert these entities during import, or I remove them during LaTeX export. The >>former seems cleaner to me, and I think this is what you had in mind. > > > Yes, in my view the storage format is Unicode without markup: > > > BibTeX ------- ---------> DocBook > | | > | | > RIS ---------+--> STORAGE ---- > | (Unicode) | > | | > RISX --------- ---------> LaTeX > > > > Whatever the input format, all references end up in the same storage > format (Unicode sans markup). This would require stripping out XML > entities and LaTeX markup. With luck you can use existing tools to do > this. The stored references can then be output in either DocBook- or > LaTeX-compatible format. This seems to be to be an elegant way of > dealing with the mishmash of input and output formats. > > Regards, > David. > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Refdb-devel mailing list > Ref...@li... > https://lists.sourceforge.net/lists/listinfo/refdb-devel |
From: David N. <dav...@sw...> - 2006-07-11 19:51:35
|
Hi Damien, Damien Jade Duff wrote: > I'd like to see the following kind of latex markup reproduced (not > stripped entirely): > TITLE = {Why {AM} and {EURISKO} appear to work}, Alternately, you could save the title in plain Unicode with that capitalisation and refdb's BibTeX output filter would wrap any abnormally capitalised words in braces. That way your reference can be used in either DocBook XML/SGML or LaTeX output. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-11 21:26:08
|
David Nebauer writes: > > TITLE = {Why {AM} and {EURISKO} appear to work}, > > Alternately, you could save the title in plain Unicode with that > capitalisation and refdb's BibTeX output filter would wrap any > abnormally capitalised words in braces. That way your reference can be > used in either DocBook XML/SGML or LaTeX output. > If it is a matter of wrapping uppercased words in curly brackets, this certainly doable. Is it likely to have uppercased words in bibtex data which are *not* supposed to be rendered in all-caps? regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Damien J. D. <D.J...@cs...> - 2006-07-12 11:28:02
|
Markus Hoenicka wrote: > David Nebauer writes: > > > TITLE = {Why {AM} and {EURISKO} appear to work}, > > > > Alternately, you could save the title in plain Unicode with that > > capitalisation and refdb's BibTeX output filter would wrap any > > abnormally capitalised words in braces. That way your reference can be > > used in either DocBook XML/SGML or LaTeX output. > > > > If it is a matter of wrapping uppercased words in curly brackets, this > certainly doable. Is it likely to have uppercased words in bibtex data > which are *not* supposed to be rendered in all-caps? > > regards, > Markus > Gidday I imagine only where the style demands it (e.g. APA). Some journals have titles with pretty much everything (both proper and non-proper nouns and subitles and acronyms) capitalised. On the other hand, APA requires nouns that aren't proper to go lower case when citing - proper nouns and subitles can be uppercase. I don't know how this is managed in Docbook styles or Endnote etc, but presumably titles are used verbatim because I can't imagine any logic funky enough to figure out whether a word is a proper noun or not without some extra markup. If we're going to export all capitals to bibtex verbatim sans markup then I think we might as well, as a first approximation, rather than trying to second guess bibtex, just set the whole title as verbatim - e.g. TITLE = {{Why AM and EURISKO appear to work}}, Folks who use latex would probably have to turn non-proper nouns that are capitalised into lowercase before putting them in to RefDB and hope to seldom encounter a citation style that requires these things to be reproduced verbatim from the original article (I don't think I have yet). The same may possibly apply to BOOKTITLE, JOURNAL, SERIES, SCHOOL, PUBLISHER, INSTITUTION etc. My 2nd pennies worth. Regards Damien |
From: Markus H. <mar...@mh...> - 2006-07-12 12:30:12
|
Damien Jade Duff <D.J...@cs...> was heard to say: > trying to second guess bibtex, just set the whole title as verbatim - e.g. > > TITLE = {{Why AM and EURISKO appear to work}}, > > Folks who use latex would probably have to turn non-proper nouns that > are capitalised into lowercase before putting them in to RefDB and hope > to seldom encounter a citation style that requires these things to be > reproduced verbatim from the original article (I don't think I have yet). > RefDB uses bibliography styles even for generating LaTeX bibliographies. These styles currently only take care of the capitalization of titles. IIRC you get the best results if you add your data using mixed-case and then pick the proper style for output. If we combine that with the curly brackets as shown above, you should be able to generate at least lowercased (except the first char), all-caps, and mixed case output. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Damien J. D. <D.J...@cs...> - 2006-07-12 13:10:26
|
> RefDB uses bibliography styles even for generating LaTeX bibliographies. Oh. Right! > should be able to generate at least lowercased (except the first char), I don't think I'd be able to use that because presumably it would lowercase proper nouns, acronyms and subtitle headings. > all-caps, and mixed case output. The latter is fine except, as before, when your article title capitalises verbs and improper nouns. I'm happy to do it this way though, I'll just make sure no capitalised verbs and improper nouns get into my database and hope not to encounter citation styles that require them. Cheers Damien |
From: David N. <dav...@sw...> - 2006-07-30 05:50:45
|
Damien Jade Duff wrote: >> all-caps, and mixed case output. >> > The latter is fine except, as before, when your article title > capitalises verbs and improper nouns. I'm happy to do it this way > though, I'll just make sure no capitalised verbs and improper nouns get > into my database and hope not to encounter citation styles that require > them. This has been bothering me. If the default model for encoding is: XML LaTeX | | | | INPUT unicode unicode \ / \ / \ / \ / v STORAGE unicode / \ / \ / \ / \ OUTPUT convert convert | | | | v v XML LaTeX There will undoubtedly be users like Damien who make the choice to include markup in their bibliographic data. Although it effectively traps them into one output format the trade-off is greater control over how that data is eventually displayed. Would not it be simple, given the above model, to have a command line switch for runbib and refdbib that skips the conversion step? That seems to me an easy way to accommodate the wishes of everybody. Regards, David. |
From: David N. <dav...@sw...> - 2006-07-30 21:29:24
|
David Nebauer wrote: > If the default model for encoding is: > > XML LaTeX > | | > | | > INPUT unicode unicode > \ / > \ / > \ / > \ / > v > STORAGE unicode > / \ > / \ > / \ > / \ > OUTPUT convert convert > | | > | | > v v > XML LaTeX Spaces were stripped. That should look like: ...........XML......LaTeX ............|.........| ............|.........| INPUT.....unicode.unicode .............\......./ ..............\...../ ...............\.../ ................\./ .................v STORAGE.......unicode ................/.\ .............../...\ ............../.....\ ............./.......\ OUTPUT....convert.convert ............|.........| ............|.........| ............v.........v ...........XML......LaTeX Regards, David. |
From: David N. <dav...@sw...> - 2006-08-17 08:39:48
|
Hi Damien, Damien Jade Duff wrote: > Incidentally, how do docbook users deal with this capitalisation > issue? i.e. capitalisation of acronyms and proper nouns versus of > subtitles and improper nouns. Anyone? The bibliography style options for TITLE are case: ASIS ICAPS LOWER UPPER style: BOLD BOLDITALIC BOLDITULINE BOLDULINE ITALIC ITULINE NONE SUB SUPER ULINE As you can see from the case options your only choice for preserving mixed case is to use ASIS, which of course means "as is". In that situation it is up to the person inputting the reference in the first place to get the case right. Does that answer your question? Regards, David. |
From: Damien J. D. <D.J...@cs...> - 2006-08-17 14:22:37
|
> >>Incidentally, how do docbook users deal with this capitalisation >>issue? i.e. capitalisation of acronyms and proper nouns versus of >>subtitles and improper nouns. Anyone? > > > The bibliography style options for TITLE are > case: ASIS ICAPS LOWER UPPER > style: BOLD BOLDITALIC BOLDITULINE BOLDULINE ITALIC ITULINE NONE SUB > SUPER ULINE > > As you can see from the case options your only choice for preserving > mixed case is to use ASIS, which of course means "as is". In that > situation it is up to the person inputting the reference in the first > place to get the case right. > > Does that answer your question? Yes, thank you. Since Docbook users get by without this extra markup I think we latex users should be able to get by without it too. I have no complaints. Though it seems logically possible, I doubt I will ever actually find an instance where anything more complicated than the above options is required. Peace Damien |
From: Damien J. D. <D.J...@cs...> - 2006-08-16 15:49:57
|
Gidday gidday As for entering unicode via jedit, there is a graphical plugin for it: http://plugins.jedit.org/plugins/?CharacterMap But I haven't been able to get it to work with unicode and it seems that the current implementation is a tad broken: http://community.jedit.org/?q=node/view/1628&pollresults%5B1252%5D=1 We have to wait for a fix to be uploaded. The good thing about using unicode as storage is that it's a blank slate and that means that if the community decides that some kind of markup is in fact useful the maintainer(s!) can choose to support it at a later point via whatever method they like and in theory they(he!) know(s) exactly what kind of data is in the database at any given time: i.e. more control over what is coming in and out. See I wouldn't be surprised if the required-capitalisation markup was asked for at some time in the future when a user wants to use latex+docbook but exploit the latex markup, but it can be done from a unicode starting point anyway. My 8.2 cents worth Incidentally, how do docbook users deal with this capitalisation issue? i.e. capitalisation of acronyms and proper nouns versus of subtitles and improper nouns. Anyone? Peace Damien David Nebauer wrote: > Damien Jade Duff wrote: > >>>all-caps, and mixed case output. >>> >> >>The latter is fine except, as before, when your article title >>capitalises verbs and improper nouns. I'm happy to do it this way >>though, I'll just make sure no capitalised verbs and improper nouns get >>into my database and hope not to encounter citation styles that require >>them. > > > This has been bothering me. If the default model for encoding is: > > ...........XML......LaTeX > ............|.........| > ............|.........| > INPUT.....unicode.unicode > .............\......./ > ..............\...../ > ...............\.../ > ................\./ > .................v > STORAGE.......unicode > ................/.\ > .............../...\ > ............../.....\ > ............./.......\ > OUTPUT....convert.convert > ............|.........| > ............|.........| > ............v.........v > > There will undoubtedly be users like Damien who make the choice to > include markup in their bibliographic data. Although it effectively > traps them into one output format the trade-off is greater control over > how that data is eventually displayed. > > Would not it be simple, given the above model, to have a command line > switch for runbib and refdbib that skips the conversion step? That seems > to me an easy way to accommodate the wishes of everybody. > > Regards, > David. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Refdb-devel mailing list > Ref...@li... > https://lists.sourceforge.net/lists/listinfo/refdb-devel |
From: Markus H. <mar...@mh...> - 2006-07-19 20:36:39
|
Hi David, David Nebauer writes: > Yes, in my view the storage format is Unicode without markup: > > > BibTeX ------- ---------> DocBook > | | > | | > RIS ---------+--> STORAGE ---- > | (Unicode) | > | | > RISX --------- ---------> LaTeX > > > I've done a little source code reading and testing in order to find out how RefDB mangles these kinds of input and output data. My results are as follows: 1) BibTeX input bib2ris appears to work ok with UTF-8 encoded bibtex data. You can import the resulting RIS data as long as the input encoding is set to UTF-8 (the current default is ISO-8859-1, but it certainly makes sense to change that). If your bibtex data is plain ASCII with foreign and special characters encoded as LaTeX commands, the bib2ris output should be sent through the new refdb_latex2utf8txt script. I don't know whether it really has 100% coverage of the character-related LaTeX command, but it is easy to extend if need arises. With this in mind we can import bibtex data as plain Unicode. 2) RIS input We'd have to educate users to author their RIS datasets in UTF-8, and to run RIS data from web sources (like Pubmed) through iconv before adding them to RefDB. All it takes is to set the default input encoding of refdbd for RIS data to UTF-8 (see above). Currently there are no provisions to translate entities or LaTeX commands, but if used correctly there should be no need to use such hacks. The result is, as above, plain Unicode. 3) risx input I've rediscovered a nice feature of expat (which refdbd uses to parse all incoming XML data). The output data of expat are always UTF-8, with all entities expanded to their Unicode equivalents. Thus no extra conversion step is required to get rid of entities and to store plain Unicode. 4) SGML/XML output (bibliographies, db31/tei/html backends) "<>&" are replaced with their corresponding entities. In addition, the current code contains replacements for — ‘ and ’. I know that I was asked to add these, but I can't remember the context. I wonder whether it would make more sense to keep these characters as Unicode. 5) LaTeX output There are currently no attempts to escape LaTeX command characters. I'm about to add this code. 6) other output (RIS, screen) No replacements. If you retrieve data as UTF-8, you'll get what you want. As always, I might have missed some bordercases. If you experience a different behaviour, please let me know. One thing that should be discussed is how easy it is for RefDB users to author UTF-8 data, be it RIS, bibtex, or XML. You can always insert the numeric form into XML data (e.g. �x00B1) but I'm afraid this won't work for the other data formats. As an Emacs user I've got Norm Walsh's xmlunicode.el (http://nwalsh.com/emacs/xmlchars/) which allows to select characters from a pop-up list or from the minibuffer with entity-name completion, and which also defines an input mode which offers on-the-fly replacement of entities. Is there similar support available for other editors (vim, jedit) which should be mentioned in the manual? regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-19 20:42:20
|
Markus Hoenicka writes: > 2) RIS input > We'd have to educate users to author their RIS datasets in UTF-8, and > to run RIS data from web sources (like Pubmed) through iconv before > adding them to RefDB. All it takes is to set the default input > encoding of refdbd for RIS data to UTF-8 (see above). Currently there > are no provisions to translate entities or LaTeX commands, but if used > correctly there should be no need to use such hacks. The result is, as > above, plain Unicode. > Actually we can still use ISO-8859-1 or whatever as the RIS input format as refdbd internally converts it to UTF-8 if the database uses this encoding. Forcing UTF-8 for RIS data actually makes only sense if people use both bibtex and RIS data. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-24 10:36:30
|
Hi Markus, Markus Hoenicka wrote: > 1) BibTeX input > [W]e can import bibtex data as plain Unicode. > > 2) RIS input > All it takes is to set the default input > encoding of refdbd for RIS data to UTF-8. > > 3) risx input > [N]o extra conversion step is required to get rid of > entities and to store plain Unicode. All looks too easy so far. > 4) SGML/XML output (bibliographies, db31/tei/html backends) > "<>&" are replaced with their corresponding entities. In addition, the > current code contains replacements for — ‘ and ’. I > know that I was asked to add these, but I can't remember the > context. I wonder whether it would make more sense to keep these > characters as Unicode. > I fear I may have partly the cause. I was using refdb purely for docbook and since I didn't use UTF-8 encoding for my references the only way to preserve characters like em dash was to protect them as entities throughout the reference's life cycle. They were not only protected during output, as you mention above, but were protected at input also. In moving to a more sensible unicode-based system, however, it no longer makes any sense to replace those characters with entities. > 5) LaTeX output > There are currently no attempts to escape LaTeX command > characters. I'm about to add this code. > I see this code arrived today. > 6) other output (RIS, screen) > No replacements. If you retrieve data as UTF-8, you'll get what you want. > > > One thing that should be discussed is how easy it is for RefDB users > to author UTF-8 data, be it RIS, bibtex, or XML. Is there [Unicode input] support > available for other editors (vim, jedit) which should be mentioned in > the manual? > Many unicode characters (and certainly all the commonly used ones) are entered by means of digraphs (using two or more keystrokes to specify one character). The mnemonics for these are fairly intuitive, like 'a:' for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx' where 'xxxx' is the character code. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-24 11:04:20
|
David Nebauer <dav...@sw...> was heard to say: > > 4) SGML/XML output (bibliographies, db31/tei/html backends) > > "<>&" are replaced with their corresponding entities. In addition, the > > current code contains replacements for — ‘ and ’. I > > know that I was asked to add these, but I can't remember the > > context. I wonder whether it would make more sense to keep these > > characters as Unicode. > > > > I fear I may have partly the cause. I was using refdb purely for > docbook and since I didn't use UTF-8 encoding for my references the only > way to preserve characters like em dash was to protect them as entities > throughout the reference's life cycle. They were not only protected > during output, as you mention above, but were protected at input also. > In moving to a more sensible unicode-based system, however, it no longer > makes any sense to replace those characters with entities. I see. Then I'll remove these entities again. > > 5) LaTeX output > > There are currently no attempts to escape LaTeX command > > characters. I'm about to add this code. > > > > I see this code arrived today. Yes. I didn't get round to announce it, but please give it a real-world test to see whether it works ok. > Many unicode characters (and certainly all the commonly used ones) are > entered by means of digraphs (using two or more keystrokes to specify > one character). The mnemonics for these are fairly intuitive, like 'a:' > for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx' > where 'xxxx' is the character code. Is there a link to some doc that explains this, by any chance? I thought about adding something like a tip box to the docs that briefly explains how to deal with Unicode characters for the most popular editors. I'd like to add URLs for further information. regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-24 15:38:51
|
Hi Markus, Markus Hoenicka wrote: >> Many unicode characters (and certainly all the commonly used ones) are >> entered by means of digraphs (using two or more keystrokes to specify >> one character). The mnemonics for these are fairly intuitive, like 'a:' >> for a-umlaut. Any unicode character can be entered with 'Ctrl-v uxxxx' >> where 'xxxx' is the character code. >> > > Is there a link to some doc that explains this, by any chance? I thought about > adding something like a tip box to the docs that briefly explains how to deal > with Unicode characters for the most popular editors. I'd like to add URLs for > further information. Vim documentation is the ultimate triumph of substance over style. In aggregate it contains every fact you could or would ever want to know about Vim. The problem is it's almost impossible to find the information you want. On the rare occasion you do it is so dry and technical as to be a foreign language altogether. There actually appear to be three general methods of entering unicode characters: 1. Digraphs I mentioned these in my previous post. This is the easiest method to learn and remember. Documentation is here: <http://vimdoc.sourceforge.net/htmldoc/digraph.html>. Same documentation is available within Vim by typing ':h digraphs'. Type ':digraphs' for a list of digraphs. 2. Keymaps Frankly, I've skimmed this topic a few times and can't make head nor tail of it. It claims unicode characters can be entered as combinations of other characters (sounds somewhat like digraphs but apparently is different). Documentation is here: <http://vimdoc.sourceforge.net/htmldoc/mbyte.html>. Same documentation is available within Vim by typing ':h multibyte'. 3. Direct entry This is done by 'Ctrl-v u xxxx' where 'xxxx' is the hex number of a unicode character. Documentation is included in multi-byte help at <http://vimdoc.sourceforge.net/htmldoc/mbyte.html#utf-8-typing> or within Vim by typing ':h utf-8-typing'. Regards, David. |
From: Markus H. <mar...@mh...> - 2006-07-12 21:07:38
|
Hi David, David Nebauer writes: > From the 'ucs' package documentation: > > The simplest use of this package is to add > \usepackage{ucs} > \usepackage[utf8x]{inputenc} > to your header. You may even omit the first line in many cases. > > > It worked for me "out of the box". I installed the 'ucs' package > (apt-get install latex-ucs), added those two lines to the preamble, ran > 'latex test' and, presto, gloriously rendered unicode. > There seems to be something else for Unicode support. Ever heard of this? Is that supposed to work without installing additional files? (cited from: http://mail.nl.linux.org/linux-utf8/2004-04/msg00000.html) ------ In mid February, the LaTeX project team released a new version that now supports UTF-8. For details, see ftp://ftp.tex.ac.uk/tex-archive/macros/latex/base/utf8ienc.dtx You now can finally simply replace \usepackage[latin1]{inputenc} with \usepackage[utf8]{inputenc} and can this way move all your non-ASCII LaTeX input to UTF-8 as well. -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: Markus H. <mar...@mh...> - 2006-07-12 23:31:35
|
Hi David, David Nebauer writes: > latex2utf8txt converts LaTeX files to UTF-8 text, removes line > breaks from paragraphs I used parts of this script to hack a tex2mail replacement. It isn't more than a collection of regular expression substitutions which is to be used as a post-processing filter of bib2ris output. I've added it to the subversion repository, but it is not yet included into the build system. Please test it against your data and modify it as needed or let me know what else should be covered. Either update your svn sources, or visit the svn web interface: http://svn.sourceforge.net/viewcvs.cgi/refdb/refdb/trunk/scripts/refdb_latex2utf8txt?view=log The script does the following: - replace foreign characters encoded as {\..} constructs with their UTF-8 counterparts - remove {\xy ...} commands, leaving only the enclosed text - remove non-escaped curly brackets - unescape escaped command characters: # $ % & ~ _ ^ \ { } - convert the LaTeX dashes '--' and '---' to '-' regards, Markus -- Markus Hoenicka mar...@ca... (Spam-protected email: replace the quadrupeds with "mhoenicka") http://www.mhoenicka.de |
From: David N. <dav...@sw...> - 2006-07-06 23:58:18
|
Hi Markus, > I'll set up a test case and see what happens. I recall it might be > necessary to specify a default database anyway (i.e. use the -d switch > of runbib) even if you specify a database in each citation. Does the > problem persist if you set a default database? Yes. I tried it with and without specifying a default database for 'runbib'. Regards, David. |