[Doxygen-develop] Doxygen's XML and encoding problems -- Translators Revisited (was : setEncoding ?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

(was "setEncoding ?") 
(The "Translator Revisited" working document attached.
If you participate on doxygen XML development, please,
read it and discuss.) 

Hi Haru and other doxygeners,

Haruyuki Ohtani asked...
> 
> I think the latest doxygen generates ISO-8859-1 encoding
> for XML output. But it does not seem to call
> QTextStream::setEncoding(QTextStream::Latin1) in
> genarateXML function. Is it correct ?

I am temporarily too busy to participate on the
development; however, I was thinking about how the
encoding problems should be solved.  It was in the
pre-XML era.  The result of the analysis was that if any
encoding problem is to be avoided for any language, more
than one encoding should be supported in the same time,
during the processing (generally) or some kind of
language-independent (i.e. encoding-independent)
solution should be used.  See the explanation below.

In other words, if the problem is touched now, it would
be nice to solve it carefully now.  XML existence can
make it easier, but it does not solve it automatically.

Firstly, as far as I know, a XML file can define
encoding being used, but only in the "per-file" way,
like:

 <?xml encoding="ISO-8859-2" ?>

Secondly, if the content of the file is a mixture of
texts taken from doxygen comments (in processed source
files to be documented) together with generated texts
(by Translator classes), then the encoding must be
unified during composition of the document.  Or the
input sources have to be converted to the same encoding
as is the one of the strings produced by the Translator
class instantiation, or the Translator object has to
convert its results.  

The the instantiation of the Translator class depends on
the chosen (human) language, and the encoding of the
generated strings depends on the decision of the
language maintainer who prepared the translated text.

The input sources encoding is not only dependent on the
human language used in comments, but it can depend also
on the OS being used.  For example, the default encoding
in MS Windows may differ from the default encoding used
in Unix.  This is the reason, why another Doxyfile
option should be introduced -- the input source
encoding. 

It is likely that some environment will start to use
Unicode versus the older 8bit encoding (i.e. similar
problem).  Because of this, some sources may use
different encoding than others.  This is also the reason
why another doxygen command should be introduced to
define the encoding of the source file explicitly, used
like here:

/*! \file xyz.cpp \encoding iso-8859-2 \brief Xxxxx... */

This could be directly used to define the heading of the
generated XML file as shown above.  Notice, that this
follows the approaches used with XML tools and files.

Even more general approach is to generate the XML output so that it 
is almost language independent.  I mean that the only
human-language related texts are extracted from the
documented source files.  The generated text can be
inserted via XML processing instructions with attributes
that will allow to decide the output language plus
encoding for the generated texts later.  The processing
instructions could look like this:

 <?doxtpl tpl="&trReimplementedFromList;" a1="&list001;" ?>

Some post-processor could be used to expand the doxygen
templates later.

However, there are some problems with the last mentioned
approach.  The main problem pointed by Dimitri is that
doxygen uses QT XML parser that is not capable to
process the entity definitions stored inside external
files.  Withou this capability, it does not make sense
to use the references to the entities.  This is probably
the main reason why Dimitri is using the approach that
resembles the current way of using translator classes.

If you are the developer who participate on XML part of
doxygen, on a generator part of doxygen, or if you are
the language maintainer familiar with XML and doxygen
internals, please read the attached document "Translator
Revisited" (just run doxygen to get the HTML form).
It is still a bit rough, but it contains many details 
that are directly related to the human languages known
to doxygen.  If QT XML parser became more capable or if
some XSLT library were chosen, then the approach could
be used.

 <<TranslatorsRevisited.zip>> 

I will appreciate comments to the "Translator Revisited"
and I will update it to include the discussion results;
however, I am currently too busy to produce any code for
doxygen.

Regards,
  Petr
-- 
Petr Prikryl, Skil, spol. s r.o., (pri...@sk...)

[Doxygen-develop] Doxygen's XML and encoding problems -- Translators Revisited (was : setEncoding ?

[Doxygen-develop] Doxygen's XML and encoding problems -- Translators Revisited (was : setEncoding ?)