RE: [Doxygen-develop] Adding of new (all) HTML entities?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Xavier and others,

Xavier wrote...
> I would like to add new entities of HTML. I am currently interested
> in Greek letters but all what is defined in HTML 4.0 would be
> interesting.

The character entities in HTML are interpreted (i.e. rendered and
displayed) by HTML browsers.  The HTML document should not care what
the character entity means.  If the character in an HTML document
has binary form, then the browser always uses additional information
to know, how the character should be rendered -- the charset.

In other words, it depends on HTML browser, whether the entity will
be displayed as some special character, or not.

So far, this is not related to doxygen at all.  However, HTML is not 
the only output format of doxygen.

There are three cases when doxygen is forced to work with character
entities:

1. The entity is written in the sources as the entity reference
   (i.e. &oslash;), and it is expected to be converted into other
   sequence (like "\\o{}" for LaTeX or something else for RTF, ...),
   or into binary form.

2. The entity is written in binary form in the sources and it should
   be converted into a named character entity (like '&' to &amp;)

3. This case is not be related to named character entities; but
   basically it requires conversion of a binary character into some
   special, output-form related text sequence or binary form, like
   in the first case.

Case 1: entity reference in the sources, specific output 
========================================================

Firstly, the character entity references are not specific to the
HTML doctypes.  The entities are known also in SGML and XML (no
wonder).  In all of the mentioned markup languages, they are
referred the same way: &identifier; Or they may have numeric form,
related to Unicode characters (loosely speaking).

XML and HTML output
-------------------

For future, I wish XML would be the only form produced by doxygen
(please, read it until the end before flaming me).  Then, the more
general approach should be taken.

In the XML output case in future (and also in the HTML case now),
the easiest and the most general solution would be to recognize the
syntax of any explicitly written entity (i.e. &oslash;) and leave it
untouched. The reason is that the entity may be either defined
explicitly (XML, SGML), or predefined by DTD (HTML) or somehow else.
Some entities may be predefined only in some language supports.

Doxygen should be transparent and pass through the character
entities as if it was any other word (with & and ;). The only
characters that must be converted for XML, SGML, and HTML outputs
should be &amp; and &lt; (and probably &gt; for the symetry).

LaTeX, RTF, and other specific outputs
--------------------------------------

The problem is that character entity references -- and let's include
also binary characters from the 3rd case -- must be converted to the
very special form required by the output format.  

Moreover, if a special binary character is used in the source, the
conversion process should generally consider three different
character encodings in the same time:

 - source encoding
 - internal encoding of doxygen -- see TranslatorXxxx classes
 - output encoding

And also, one human language may use several possible encoding
(e.g. ISO, Windows, DOS, or some former national de facto standard)
This is really nightmare.  If doxygen should work with everything on
the input and everything on the output, it would take a long time to
solve the problems.

Solution? XML!

Case 2: binary in sources, converted for the output
===================================================

If the source contains the special binary characters, then the HTML
(XML, SGML) output should be able to accept also the unchanged
binary form of the characters. In that case, the HTML (...) document
should explicitly say what character set is used. 

   <meta content="text/html; charset=windows-1250">

For that purpose, the TranslatorXxxx implements the method
idLanguageCharset() which is used to produce the above mentioned
meta tag in the HTML output.

Again, Doxygen should be transparent and pass through the character
entities as if it was any other word (with & and ;). The only
characters that must be converted for XML, SGML, and HTML outputs
should be &amp; and &lt; (and probably &gt; for the symetry).

Another question is how to produce a multi-language document.
I am afraid that this cannot be solved without XML output as the 
final or intermediate format.  

Conclusion: XML output is the answer!
=====================================

Trying to convert every special binary character to its character
entity reference or vice versa -- with respect to the output format
and input/output character encodings -- would require much work and
is not acceptable with respect to the future, and with respect to
the work already done in XML and the like fiels.

The conversion into other output formats (like RTF, LaTeX, PDF,
etc.) should be done by XML tools. Otherwise, the doxygen would have
to implement all the character encoding conversions, character
entity reference conversions, and possibly other problems that are
already implemented by the XML tools.

In my personal opinion, future doxygen should focus on producing
quality XML output ONLY -- possibly using more than one markup
(DocBook XML be the major one, possibly also some proprietary XML in
the sense similar to producing RTF specific for some versions of
Microsoft Word).  Focusing on a single output form can make doxygen
lighter, faster, containing fewer bugs.

In XML, one can say that the block of text uses certain character
encoding -- no problem with multiple character encodings in one
document and even one file -- which would be great for doxygen:

 * The block of text from input can be passed without problems
   in binary form to the output.
 * The sentences generated by doxygen can use some internal
   encoding without any conversion.
 * If the character is written as known entity reference, then it 
   is always unique and can be passed without problems.

In the best case, doxygen could call (in addition) external XML tools
to produce other output forms (technically the same way
as 'dot' is called these days -- "user need not know that"). 

For backward compatibility, the other directly produced output
format (i.e. the outputs produced these days) should stay supported
for a while, but their development should be frozen at the same time
when the same or very similar could be produced using XML output and
XML tools.

See you,
  Petr

P.S. I may be wrong.  But I am always ready to listen to your arguments.

     Do not multiply [character] entities.
                                            Occam ;-)
-- 
Petr Prikryl, SKIL, spol. s r.o., pri...@sk...