Re: [Doxygen-develop] Adding of new (all) HTML entities?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ahoj Petre and the other,

"Prikryl,Petr" wrote:

> Hi Xavier and others,
>
> Xavier wrote...
> > I would like to add new entities of HTML. I am currently interested
> > in Greek letters but all what is defined in HTML 4.0 would be
> > interesting.
>
> The character entities in HTML are interpreted (i.e. rendered and
> displayed) by HTML browsers.  The HTML document should not care what
> the character entity means.  If the character in an HTML document
> has binary form, then the browser always uses additional information
> to know, how the character should be rendered -- the charset.

> In other words, it depends on HTML browser, whether the entity will
> be displayed as some special character, or not.
>
> So far, this is not related to doxygen at all.  However, HTML is not
> the only output format of doxygen.

It is a little, currently Doxygen allows to use some of these entities
in source input.

> There are three cases when doxygen is forced to work with character
> entities:
>
> 1. The entity is written in the sources as the entity reference
>    (i.e. &oslash;), and it is expected to be converted into other
>    sequence (like "\\o{}" for LaTeX or something else for RTF, ...),
>    or into binary form.

That was the case I am interested in. Very convenient if you work
on an American keyboard and want to have French (or other)
special characters.

> 2. The entity is written in binary form in the sources and it should
>    be converted into a named character entity (like '&' to &amp;)

Again as far as I am concerned, I prefer used entities because it
will be interpreted the same whatever the charset used. (Well I hope)

>
> 3. This case is not be related to named character entities; but
>    basically it requires conversion of a binary character into some
>    special, output-form related text sequence or binary form, like
>    in the first case.
>
> Case 1: entity reference in the sources, specific output
> ========================================================
>
> Firstly, the character entity references are not specific to the
> HTML doctypes.  The entities are known also in SGML and XML (no
> wonder).  In all of the mentioned markup languages, they are
> referred the same way: &identifier; Or they may have numeric form,
> related to Unicode characters (loosely speaking).
>
> XML and HTML output
> -------------------
>
> For future, I wish XML would be the only form produced by doxygen
> (please, read it until the end before flaming me).  Then, the more
> general approach should be taken.

It's certainly a good idea .. if there is still possibility (integrated in
Doxygen)
to have the current multiple output: HTML, LaTeX and others.

> In the XML output case in future (and also in the HTML case now),
> the easiest and the most general solution would be to recognize the
> syntax of any explicitly written entity (i.e. &oslash;) and leave it
> untouched.

I think this would be a good idea. Does it exist a single set of such
entities? I don't speak about numerical entities that are not especially
convenient even if it should be also availble in the case this idea is
implemented. I mean a readable set like &eacute; OK it will be
in English but it's more readable than &#233; :)

> The reason is that the entity may be either defined
> explicitly (XML, SGML), or predefined by DTD (HTML) or somehow else.
> Some entities may be predefined only in some language supports.
>
> Doxygen should be transparent and pass through the character
> entities as if it was any other word (with & and ;).

This would be nice if you have only one output in XML or HTML
but anyway, you would need something to make conversion for
other outputs.

> The only
> characters that must be converted for XML, SGML, and HTML outputs
> should be &amp; and &lt; (and probably &gt; for the symetry).
>
> LaTeX, RTF, and other specific outputs
> --------------------------------------
>
> The problem is that character entity references -- and let's include
> also binary characters from the 3rd case -- must be converted to the
> very special form required by the output format.
>
> Moreover, if a special binary character is used in the source, the
> conversion process should generally consider three different
> character encodings in the same time:
>
>  - source encoding
>  - internal encoding of doxygen -- see TranslatorXxxx classes
>  - output encoding
>
> And also, one human language may use several possible encoding
> (e.g. ISO, Windows, DOS, or some former national de facto standard)
> This is really nightmare.  If doxygen should work with everything on
> the input and everything on the output, it would take a long time to
> solve the problems.
>
> Solution? XML!

So you mean to use special character entities as a standard
for Doxygen.
What about part \latexonly in source documentation.
It must continue to work to my opinion. If not some users
could be disappointed.

> Case 2: binary in sources, converted for the output
> ===================================================
>
> If the source contains the special binary characters, then the HTML
> (XML, SGML) output should be able to accept also the unchanged
> binary form of the characters. In that case, the HTML (...) document
> should explicitly say what character set is used.
>
>    <meta content="text/html; charset=windows-1250">
>
> For that purpose, the TranslatorXxxx implements the method
> idLanguageCharset() which is used to produce the above mentioned
> meta tag in the HTML output.

The charset should be precised inthe input document in this case.
Otherwise there could be some mismatch.

> Again, Doxygen should be transparent and pass through the character
> entities as if it was any other word (with & and ;). The only
> characters that must be converted for XML, SGML, and HTML outputs
> should be &amp; and &lt; (and probably &gt; for the symetry).
>
> Another question is how to produce a multi-language document.
> I am afraid that this cannot be solved without XML output as the
> final or intermediate format.

In this case, there is one solution, I don't know if it's the only
one: Unicode. With other charset on 8 bits it's not possible
to have complete multiple language facility.

> Conclusion: XML output is the answer!

> =====================================

> Trying to convert every special binary character to its character
> entity reference or vice versa -- with respect to the output format
> and input/output character encodings -- would require much work and
> is not acceptable with respect to the future, and with respect to
> the work already done in XML and the like fiels.
>
> The conversion into other output formats (like RTF, LaTeX, PDF,
> etc.) should be done by XML tools. Otherwise, the doxygen would have
> to implement all the character encoding conversions, character
> entity reference conversions, and possibly other problems that are
> already implemented by the XML tools.

OK but they must be integrated to Doxygen and transparent for
the user who prefer HTML to XML even if this can be customized
to have any output.

> In my personal opinion, future doxygen should focus on producing
> quality XML output ONLY -- possibly using more than one markup
> (DocBook XML be the major one, possibly also some proprietary XML in
> the sense similar to producing RTF specific for some versions of
> Microsoft Word).  Focusing on a single output form can make doxygen
> lighter, faster, containing fewer bugs.

Yes but one interesting feature of Doxygen is precisely this
multiple input/output. If using a intermediate XML output could
be interesting, Doxygen should be able to provide the other
output, even if it is done using other existing tools. And to use
the multiple input.Currently only LaTeX and HTML I believe.

> [...]
> See you,
>   Petr
>
> P.S. I may be wrong.  But I am always ready to listen to your arguments.
>
>      Do not multiply [character] entities.

I agree from the long term perspective. I simply proposed that because
that's the way it is currently. I already ask for adding &Ccedil;, yes it's
because of me it is in doxygen. :) I tought that I could help to do the rest

(what exist for HTML 4.0). I would not be able to do what you describe.

But indeed your idea, from the _user_ point of view, will be similar to
mine.
It's a kind of generalisation of what exist already.
Except for non \latexonly or \htmlonly sections where usage of non-LaTeX or
non-HTML is not relevant, the only one way to write special character
for Doxygen documentation would be using entities.

Now choices have to be made and precise specifications written in
a separate document. Mails are not enough to discuss seriously.
I may have time to review it but not writing it neither to develop it. :(

Greetings,

Xavier.
--
 Artificial Anthill Project
  http://www.aanthill.org/
  mailto:aan...@aa...

 D2SET Non Profit Association
  http://www.d2set.org/
  mailto:d2...@d2...