[Doxygen-develop] Doxygen's XML and encoding problems -- Translators Revisited (was : setEncoding ?
Brought to you by:
dimitri
From: Prikryl,Petr <PRI...@sk...> - 2002-01-15 11:14:18
|
(was "setEncoding ?") (The "Translator Revisited" working document attached. If you participate on doxygen XML development, please, read it and discuss.) Hi Haru and other doxygeners, Haruyuki Ohtani asked... > > I think the latest doxygen generates ISO-8859-1 encoding > for XML output. But it does not seem to call > QTextStream::setEncoding(QTextStream::Latin1) in > genarateXML function. Is it correct ? I am temporarily too busy to participate on the development; however, I was thinking about how the encoding problems should be solved. It was in the pre-XML era. The result of the analysis was that if any encoding problem is to be avoided for any language, more than one encoding should be supported in the same time, during the processing (generally) or some kind of language-independent (i.e. encoding-independent) solution should be used. See the explanation below. In other words, if the problem is touched now, it would be nice to solve it carefully now. XML existence can make it easier, but it does not solve it automatically. Firstly, as far as I know, a XML file can define encoding being used, but only in the "per-file" way, like: <?xml encoding="ISO-8859-2" ?> Secondly, if the content of the file is a mixture of texts taken from doxygen comments (in processed source files to be documented) together with generated texts (by Translator classes), then the encoding must be unified during composition of the document. Or the input sources have to be converted to the same encoding as is the one of the strings produced by the Translator class instantiation, or the Translator object has to convert its results. The the instantiation of the Translator class depends on the chosen (human) language, and the encoding of the generated strings depends on the decision of the language maintainer who prepared the translated text. The input sources encoding is not only dependent on the human language used in comments, but it can depend also on the OS being used. For example, the default encoding in MS Windows may differ from the default encoding used in Unix. This is the reason, why another Doxyfile option should be introduced -- the input source encoding. It is likely that some environment will start to use Unicode versus the older 8bit encoding (i.e. similar problem). Because of this, some sources may use different encoding than others. This is also the reason why another doxygen command should be introduced to define the encoding of the source file explicitly, used like here: /*! \file xyz.cpp \encoding iso-8859-2 \brief Xxxxx... */ This could be directly used to define the heading of the generated XML file as shown above. Notice, that this follows the approaches used with XML tools and files. Even more general approach is to generate the XML output so that it is almost language independent. I mean that the only human-language related texts are extracted from the documented source files. The generated text can be inserted via XML processing instructions with attributes that will allow to decide the output language plus encoding for the generated texts later. The processing instructions could look like this: <?doxtpl tpl="&trReimplementedFromList;" a1="&list001;" ?> Some post-processor could be used to expand the doxygen templates later. However, there are some problems with the last mentioned approach. The main problem pointed by Dimitri is that doxygen uses QT XML parser that is not capable to process the entity definitions stored inside external files. Withou this capability, it does not make sense to use the references to the entities. This is probably the main reason why Dimitri is using the approach that resembles the current way of using translator classes. If you are the developer who participate on XML part of doxygen, on a generator part of doxygen, or if you are the language maintainer familiar with XML and doxygen internals, please read the attached document "Translator Revisited" (just run doxygen to get the HTML form). It is still a bit rough, but it contains many details that are directly related to the human languages known to doxygen. If QT XML parser became more capable or if some XSLT library were chosen, then the approach could be used. <<TranslatorsRevisited.zip>> I will appreciate comments to the "Translator Revisited" and I will update it to include the discussion results; however, I am currently too busy to produce any code for doxygen. Regards, Petr -- Petr Prikryl, Skil, spol. s r.o., (pri...@sk...) |