Re: [Refdb-devel] latex bibliographies with multiple databases

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi David,

David Nebauer <dav...@sw...> was heard to say:

> If you instead go with my idea to store as unicode you don't need to
> know anything about the eventual output format when you store the
> reference.  Indeed, the user doesn't have to know at that time.  The
> same references can be used for either DocBook or LaTeX.  You can easily
> add in other output formats later and all you have to do is write
> another output filter.
>

I'm all with you here. I just wanted to get opinions from real-world LaTeX users
whether or not it makes sense to preserve the markup.

> Your point is true but I say it is a small loss.  Using LaTeX formatting
> codes means your references can never be used for any other format
> without hacking in some kind of conversion.  RefDB is designed to be a
> long-term reference database enabling the contained references to be
> used all kinds of interesting ways.  Use of format-specific markup
> limits your future choices.  As a minor example it prevents their use in
> DocBook documents.

True, but I assumed that only those might want to keep the markup who use RefDB
solely for LaTeX.

>
> Another issue is the ability of library and indexing systems to handle
> such formatting complexities as superscripting, subscripting and font
> changes.  You know far more about such things than I, but I would guess
> even the most complex article title is reduced to canonical ascii for
> storage in many cataloguing systems.  I presume the algorithms for such
> simplification are fairly predictable.  Anyone searching for the journal
> article by title would be easily able to predict the stored character
> sequence.  I would endeavour to suggest the simplified form of title
> would be entirely acceptable in any kind of bibliography.
>
> In any event, how would such a complex title be stored in plain ascii?
> Or Unicode?  Or even XML (imagine the attempt to use MathML in a title
> string!)?
>

The database which I use mostly (www.pubmed.org) indeed "ascii-izes" the titles.
The tagged format uses plain ASCII with a pretty crude transliteration, whereas
the XML format uses Unicode.

> As mentioned above, I am unconvinced about the utility of keeping
> boldface, italics, superscript and subscript-type markup.  As for
> foreign characters, almost any foreign character can be represented in

I'm afraid I didn't express my thoughts very well here. What I was talking about
is that a reference imported from bibtex may contain markup like

"Title with an {\bf emphasized} word"

It is not sufficient to escape characters but we have to remove the "{\bf " and
the "}" sequences before we import the reference. This is what one of the
scripts that you pointed me to as well as tex2mail do.

>     To allow attribute values to contain both single and double quotes,
>     the apostrophe or single-quote character (') may be represented as
>     "&apos;", and the double-quote character (") as "&quot;".
>
>
> The relevant portion states, "The right angle bracket (>) *may* be
> represented using the string '&gt;'," but "*must*, for compatibility, be
> escaped using '&gt;' or a character reference when it appears in the
> string ']]>'." (emphases mine)
>
> The last paragraph in the quote refers to straight single and double
> quotation mark entities.
>

But it appears to talk about attribute values. XML output from RefDB never puts
quotes into attribute values, so we're left with &,<,>.

> It worked for me "out of the box".  I installed the 'ucs' package
> (apt-get install latex-ucs), added those two lines to the preamble, ran
> 'latex test' and, presto, gloriously rendered unicode.
>

This is great news indeed. I will have to mention this in the manual

I take from this discussion:

1) Use a bib2ris post-processing script (or rewrite bib2ris to contain such
code) which strips markup like boldface, superscript etc. and translates
foreign characters entered as LaTeX constructs to their Unicode equivalents.

2) Modify the code to prevent XML entities to show up in LaTeX output.

3) Add code to escape the LaTeX command characters in the LaTeX output.

The second point is a bit tricky. References imported from RIS usually do not
contain entities, but references imported from risx are likely to do. Either I
convert these entities during import, or I remove them during LaTeX export. The
former seems cleaner to me, and I think this is what you had in mind.

regards,
Markus

-- 
Markus Hoenicka
mar...@ca...
(Spam-protected email: replace the quadrupeds with "mhoenicka")
http://www.mhoenicka.de