Re: [DB2LaTeX-devel] TeXML, the XML vocabulary for TeX

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

  Hi!

On Thu, 25 Mar 2004 13:43:51 +0100
Torsten Bronger <br...@ph...> wrote:

> Halloechen!
> 
> Oleg Paraschenko <ol...@da...> writes:
> 
> > On Thu, 25 Mar 2004 12:42:49 +0100
> > Torsten Bronger <br...@ph...> wrote:
> >
> >> [...]  Interesting, but how is it implemented?  In XSLT, or a
> >> scripting language, or what?
> >
> > It is implemented in the Python scripting language.
> 
> I don't know Python.  How easy can this be installed on a Windows
> system?

  It should not be a problem. You can download Python from the
http://www.python.org/download/ , install it and run scripts from
a command line. For example:

d:\python23\python.exe texml.py -e ascii test.xml test.tex

> 
> > It uses only core Python modules (expat XML parser, unicode
> > database, something other), so it should work on any recent
> > system. Mapping from Unicode characters to LaTeX commands is taken
> > from attachment for the MathML specification
> > (http://www.w3.org/Math/characters/unicode.xml (note: 1,5 Mb)).
> 
> And is it mode-aware?

  Yes, it is mode-aware. It knows text and math.

> Does an alpha become \alpha in formulae and a
> Greek letter elsewhere?

  I tested and found that in both modes result is "\alpha ". (Or the
letter alpha itself if output is in Greek encoding. I consider it is ok
because I see a very small difference between "$\alpha $" and "$a$")

>  What about ligatures like "--"?  Is this an
> en-dash or two hyphens?

  As ligatures in TeX are the property of fonts and are not the property
of a document, and as the TeXML processor can't guess what font will be
used, the processor ignores ligatures at all. As result, "--" in TeXML is
translating into "--" in TeX, which is interpreted as en-dash. At time
of development I was considering that it is a correct behaviour. Now I'm
changing my mind and adding handling of "--" and "---" to the list of
bugs. Anyway, I don't plan to break ligatires like "fi", "fl" etc.

>  What about typographic things like thin
> spaces, soft hyphens, zero-width non-joiner and "break permitted
> here"?  How much of Unicode is covered yet?

  There are two translation tables, one for text mode, another one for
math mode. There is 2361 symbols for text mode and 195 symbols for math
mode (math mode reuses text mode if symbol not found).

  For mentioned typographic things, here is a test:

| TeXML:
|
| <TeXML>&#x3B1;<math>&#x3B1;</math>
| thin space:            [&#x2009;]
| soft hyphens:          [&#xAD;]
| zero-width non-joiner: [&#8204;]  oops here ...
| break permitted here:  [&#x82;]    ... and here
| </TeXML>
|
| TeX:
| \alpha $\alpha $
| thin space:            [\hspace{0.167em}]
| soft hyphens:          [\-]
| zero-width non-joiner: [&#x200C;]  oops here ...
| break permitted here:  [&#x82;]    ... and here

  As we see, not all characters are mapped. If it is an issue, then it is
an issue for supporters of the unicode map of the MathML specification.
After they approve and fix a problem, the TeXML processor also will be
updated.

> 
> >> How fast is it (I'm not prepared to accept a further significant
> >> drop down in speed)?
> >
> > It is hard to said exactly, but I think it is fast. In any case,
> > it should be faster then processing of specials by xslt.
> 
> Okay; I asked because using it would mean to translate
> XML--XML-->text instead of XML-->text-->filter-->text, where
> "filter" is *very* fast.  But faster than XSLT may be enough.
> 
> >> How are different \usepackage[???]{inputenc}'s dealt with?
> >
> > The processor does not know about \usepackage, it only translates
> > characters. It is a task of an xslt to insert \usepackage command into
> > the output, if required.
> 
> So I always have to include things like wasy, pifont, textcomp etc?
> Wouldn't be a problem, I just need a complete list.

  Maybe I don't understand the question well, so repeat the qeustion if I
give no answer. The TeXML processor does not add anything. So (imagine),
if the processor generates "\alpha", and usage of "\alpha" in TeX document
requires package "greekfont", you will probably get an error from LaTeX.
I have no good solution yet.

> 
> > User can specify an output encoding. The processor attempts to make as
> > good translation as possible for it.
> 
> Sounds nice.  Are you aware of the very new utf-8 that was added to
> the LaTeX core two months ago?  How good does it work?

  I don't know yet if it works good. One of the problems is that Unicode
itself is not enough. There are right-to-left languages, dynamic ligatures
and other issues, so I'm investigating omega/lambda, not a LaTeX core.

> 
> Tschoe,
> Torsten.
> 
> -- 
> Torsten Bronger, aquisgrana, europa vetus
> 

  Bye!

--
Oleg