RE: [Doxygen-develop] Adding of new (all) HTML entities?
Brought to you by:
dimitri
From: Prikryl,Petr <PRI...@sk...> - 2001-07-19 14:20:05
|
Hi Xavier and others, Xavier wrote... > I would like to add new entities of HTML. I am currently interested > in Greek letters but all what is defined in HTML 4.0 would be > interesting. The character entities in HTML are interpreted (i.e. rendered and displayed) by HTML browsers. The HTML document should not care what the character entity means. If the character in an HTML document has binary form, then the browser always uses additional information to know, how the character should be rendered -- the charset. In other words, it depends on HTML browser, whether the entity will be displayed as some special character, or not. So far, this is not related to doxygen at all. However, HTML is not the only output format of doxygen. There are three cases when doxygen is forced to work with character entities: 1. The entity is written in the sources as the entity reference (i.e. ø), and it is expected to be converted into other sequence (like "\\o{}" for LaTeX or something else for RTF, ...), or into binary form. 2. The entity is written in binary form in the sources and it should be converted into a named character entity (like '&' to &) 3. This case is not be related to named character entities; but basically it requires conversion of a binary character into some special, output-form related text sequence or binary form, like in the first case. Case 1: entity reference in the sources, specific output ======================================================== Firstly, the character entity references are not specific to the HTML doctypes. The entities are known also in SGML and XML (no wonder). In all of the mentioned markup languages, they are referred the same way: &identifier; Or they may have numeric form, related to Unicode characters (loosely speaking). XML and HTML output ------------------- For future, I wish XML would be the only form produced by doxygen (please, read it until the end before flaming me). Then, the more general approach should be taken. In the XML output case in future (and also in the HTML case now), the easiest and the most general solution would be to recognize the syntax of any explicitly written entity (i.e. ø) and leave it untouched. The reason is that the entity may be either defined explicitly (XML, SGML), or predefined by DTD (HTML) or somehow else. Some entities may be predefined only in some language supports. Doxygen should be transparent and pass through the character entities as if it was any other word (with & and ;). The only characters that must be converted for XML, SGML, and HTML outputs should be & and < (and probably > for the symetry). LaTeX, RTF, and other specific outputs -------------------------------------- The problem is that character entity references -- and let's include also binary characters from the 3rd case -- must be converted to the very special form required by the output format. Moreover, if a special binary character is used in the source, the conversion process should generally consider three different character encodings in the same time: - source encoding - internal encoding of doxygen -- see TranslatorXxxx classes - output encoding And also, one human language may use several possible encoding (e.g. ISO, Windows, DOS, or some former national de facto standard) This is really nightmare. If doxygen should work with everything on the input and everything on the output, it would take a long time to solve the problems. Solution? XML! Case 2: binary in sources, converted for the output =================================================== If the source contains the special binary characters, then the HTML (XML, SGML) output should be able to accept also the unchanged binary form of the characters. In that case, the HTML (...) document should explicitly say what character set is used. <meta content="text/html; charset=windows-1250"> For that purpose, the TranslatorXxxx implements the method idLanguageCharset() which is used to produce the above mentioned meta tag in the HTML output. Again, Doxygen should be transparent and pass through the character entities as if it was any other word (with & and ;). The only characters that must be converted for XML, SGML, and HTML outputs should be & and < (and probably > for the symetry). Another question is how to produce a multi-language document. I am afraid that this cannot be solved without XML output as the final or intermediate format. Conclusion: XML output is the answer! ===================================== Trying to convert every special binary character to its character entity reference or vice versa -- with respect to the output format and input/output character encodings -- would require much work and is not acceptable with respect to the future, and with respect to the work already done in XML and the like fiels. The conversion into other output formats (like RTF, LaTeX, PDF, etc.) should be done by XML tools. Otherwise, the doxygen would have to implement all the character encoding conversions, character entity reference conversions, and possibly other problems that are already implemented by the XML tools. In my personal opinion, future doxygen should focus on producing quality XML output ONLY -- possibly using more than one markup (DocBook XML be the major one, possibly also some proprietary XML in the sense similar to producing RTF specific for some versions of Microsoft Word). Focusing on a single output form can make doxygen lighter, faster, containing fewer bugs. In XML, one can say that the block of text uses certain character encoding -- no problem with multiple character encodings in one document and even one file -- which would be great for doxygen: * The block of text from input can be passed without problems in binary form to the output. * The sentences generated by doxygen can use some internal encoding without any conversion. * If the character is written as known entity reference, then it is always unique and can be passed without problems. In the best case, doxygen could call (in addition) external XML tools to produce other output forms (technically the same way as 'dot' is called these days -- "user need not know that"). For backward compatibility, the other directly produced output format (i.e. the outputs produced these days) should stay supported for a while, but their development should be frozen at the same time when the same or very similar could be produced using XML output and XML tools. See you, Petr P.S. I may be wrong. But I am always ready to listen to your arguments. Do not multiply [character] entities. Occam ;-) -- Petr Prikryl, SKIL, spol. s r.o., pri...@sk... |