[Doxygen-develop] RE: Doxygen and XML discussion... (was: Adding of new (all) HTML entities?)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Xavier,

(Notice: I have just received new mail from doxygen-develop 
by Dimitri.  The notices below do not reflect his view -- being
written earlier.)

Xavier wrote...
>
> "Prikryl,Petr" wrote:
> 
> > Xavier wrote...
> > > I would like to add new entities of HTML. I am currently interested
> > > in Greek letters but all what is defined in HTML 4.0 would be
> > > interesting.

[...]

> > There are three cases when doxygen is forced to work with character
> > entities:
> >
> > 1. The entity is written in the sources as the entity reference
> >    (i.e. &oslash;), and it is expected to be converted into other
> >    sequence (like "\\o{}" for LaTeX or something else for RTF, ...),
> >    or into binary form.
> 
> That was the case I am interested in. Very convenient if you work
> on an American keyboard and want to have French (or other)
> special characters.

If the output were XML, you could always use the known references
to character entities.

> > 2. The entity is written in binary form in the sources and it should
> >    be converted into a named character entity (like '&' to &amp;)
> 
> Again as far as I am concerned, I prefer used entities because it
> will be interpreted the same whatever the charset used. (Well I hope)

The "predefined" character entities that we are talking about
are unambiguous.  However, you cannot be sure if any browser will
render them.

> > For future, I wish XML would be the only form produced by doxygen
> > (please, read it until the end before flaming me).  Then, the more
> > general approach should be taken.
> 
> It's certainly a good idea .. if there is still possibility (integrated in
> Doxygen) to have the current multiple output: HTML, LaTeX and others.

Doxygen could call the XML tool internally for the generated XML
output.  Users want the result.  They do not care how the things are
implemented.

> > In the XML output case in future (and also in the HTML case now),
> > the easiest and the most general solution would be to recognize the
> > syntax of any explicitly written entity (i.e. &oslash;) and leave it
> > untouched.
> 
> I think this would be a good idea. Does it exist a single set of such
> entities? I don't speak about numerical entities that are not especially
> convenient even if it should be also availble in the case this idea is
> implemented. I mean a readable set like &eacute; OK it will be
> in English but it's more readable than &#233; :)

There is not a single but several sets of definitions of the
entities.  As the characters are assigned unique binary
identification, the entities are defined unambiguosly. 

They have ISO definitions.  The handy place to look for their names,
Unicode #, Glyph, and ISO Description is for example the on-line
version of "DocBook: The Definitive Guide" by Norman Walsh and
Leonard Muellner.  Look for "III. DocBook Character Entity
Reference".

> > The reason is that the entity may be either defined
> > explicitly (XML, SGML), or predefined by DTD (HTML) or somehow else.
> > Some entities may be predefined only in some language supports.
> >
> > Doxygen should be transparent and pass through the character
> > entities as if it was any other word (with & and ;).
> 
> This would be nice if you have only one output in XML or HTML
> but anyway, you would need something to make conversion for
> other outputs.

This conversion can be done using XML tools.

> > Moreover, if a special binary character is used in the source, the
> > conversion process should generally consider three different
> > character encodings in the same time:
> >
> >  - source encoding
> >  - internal encoding of doxygen -- see TranslatorXxxx classes
> >  - output encoding
> >
> > And also, one human language may use several possible encoding
> > (e.g. ISO, Windows, DOS, or some former national de facto standard)
> > This is really nightmare.  If doxygen should work with everything on
> > the input and everything on the output, it would take a long time to
> > solve the problems.
> >
> > Solution? XML!
> 
> So you mean to use special character entities as a standard
> for Doxygen.

Not at all.  Here, by "special (binary) character" I mean
(unformally) any character that is not ASCII. While I can
occasionally type some &oslash; in my Czech text, I will always
prefer using normal way of entering special Czech character (using
Czech keyboard).  They are entered into the file as bytes greater
than (say) 127.  Or possibly, I may use Unicode editor later.
The text still should be readable -- even in the editor window.
Writing my name like "Petr P&rcaron;ikryl" should be really
rare.

What I possibly could say that it should not matter whether you
decided to type special character in binary form (some encoding) or
using named character entities.  During processing XML documents,
the named character entities are replaced by the binary encoded
characters (Unicode), so there is no difference whether you type in
named character entity reference (with the correct entity
definition) or the binary form (in the corect encoding context).

Unlike XML tools, doxygen does not work internally with Unicode.
This is the reason, in my opinion, why the general encoding 
conversion support would be very difficult to implement.

> What about part \latexonly in source documentation.
> It must continue to work to my opinion. If not some users
> could be disappointed.

Well, I was a big fan of LaTeX earlier.  These days I think that
DocBook XML is better for future.

Anyway, I think that it is not a big problem to store
LaTeX source reliably inside the intermediate XML document and
extract real, LaTeX source from that XML doc.

> > Case 2: binary in sources, converted for the output
> > ===================================================
> >
> > If the source contains the special binary characters, then the HTML
> > (XML, SGML) output should be able to accept also the unchanged
> > binary form of the characters. In that case, the HTML (...) document
> > should explicitly say what character set is used.
> >
> >    <meta content="text/html; charset=windows-1250">
> >
> > For that purpose, the TranslatorXxxx implements the method
> > idLanguageCharset() which is used to produce the above mentioned
> > meta tag in the HTML output.
> 
> The charset should be precised in the input document in this case.
> Otherwise there could be some mismatch.

Exactly.  Now, doxygen expects the single implicit encoding of
input sources, single (implicit) encoding of the internal strings,
and single implicit encoding of the output.

If only one input encoding were used in all sources, then the
encoding could be marked in the Doxyfile (configuration).  However,
there should be also a possibility to change the encoding on-the-fly
-- i.e. some new doxygen tags.  This way some internal tables could
be used to unify the input, internal and output encodings.

The better approach would be to produce the intermediate XML
document with blocks of text with explicitly marked encoding (less
work for doxygen, no problem for XML tools).

> > Another question is how to produce a multi-language document.
> > I am afraid that this cannot be solved without XML output as the
> > final or intermediate format.
> 
> In this case, there is one solution, I don't know if it's the only
> one: Unicode. With other charset on 8 bits it's not possible
> to have complete multiple language facility.

Yes, exactly.  Unless, you can pass the problem to the XML tools.

In other words, if sources were stored using Unicode (or the like)
encoding, no problem would emerge.  However, this is not realistic.
The majority of USERS (and also tools) is not adapted to Unicode,
yet.

On the other hand, XML document can be produced using normal 8bit
character editors with explicit tags around marking the used
encoding. The XML tools convert the 8bit chars blocks-of-text into
Unicode internally, during the processing.  Doxygen would not need
to care if it produced the XML output postprocessed by XML tools.

> > The conversion into other output formats (like RTF, LaTeX, PDF,
> > etc.) should be done by XML tools. Otherwise, the doxygen would have
> > to implement all the character encoding conversions, character
> > entity reference conversions, and possibly other problems that are
> > already implemented by the XML tools.
> 
> OK but they must be integrated to Doxygen and transparent for
> the user who prefer HTML to XML even if this can be customized
> to have any output.

This would be ideal.  But the integration does not neccessarily
means to put everything into one binary executable.

> > In my personal opinion, future doxygen should focus on producing
> > quality XML output ONLY -- possibly using more than one markup
> > (DocBook XML be the major one, possibly also some proprietary XML in
> > the sense similar to producing RTF specific for some versions of
> > Microsoft Word).  Focusing on a single output form can make doxygen
> > lighter, faster, containing fewer bugs.
> 
> Yes but one interesting feature of Doxygen is precisely this
> multiple input/output. If using a intermediate XML output could
> be interesting, Doxygen should be able to provide the other
> output, even if it is done using other existing tools. And to use
> the multiple input. Currently only LaTeX and HTML I believe.

Yes, they are the strong and weak points at the same time.  

The strong point is that having one binary executable, you have
everything you need.

The problem is that it generally works correctly only for English
and possibly for few other languages.  To make it working for e.g.
Czech would require some non-trivial internal changes of doxygen.

Because of that, I guess that there will always be a group of users
who are satisfied with doxygen as-is, and who would be against the
big changes (the English language users).  They may claim that there
is nothing big to improve.

Another weak point is that the set of generated outputs is
restricted by doxygen.  As far as I know, the DocBook XML output 
could be used to produce all the outputs that one gets from doxygen
today.  Similarly, the input formats are restricted by the internal
parser (basically C/C++/Java) and only slightly extended via input
filters.

Someone named XML "the ASCII of the future", and there is a strong
evidence that DocBook DTD will dominate in the area of writing
technical documentation.  As far as I can imagine, there are no
limitations with respect what doxygen can produce now and also in
future.  

DocBook is said to be the second most widespread DTD (HTML is the
first one).  However, the HTML is rather more presentation-oriented.
In other words, HTML is used rather as the final form of documents.
DocBook, on the other hand, is better for capturing the structure of
the document with all the necessary redundancy for postprocessing
into more output forms.

LaTeX is somewhere between.  While its main plus is the ability to 
capture the structure of the document, it is still very special and
oriented to producing the final output, including some visual
details that are not reliably convertable into other form of the
output.

> Now choices have to be made and precise specifications written in
> a separate document. Mails are not enough to discuss seriously.

Exactly.  In my opinion, we should focuse on the back-end at the
beginning. 

-- 
Petr Prikryl, SKIL, spol. s r.o., pri...@sk...