[Doxygen-develop] RE: Doxygen and XML discussion... (was: Adding of new (all) HTML entities?)
Brought to you by:
dimitri
From: Prikryl,Petr <PRI...@sk...> - 2001-07-23 10:53:33
|
Hi Xavier, (Notice: I have just received new mail from doxygen-develop by Dimitri. The notices below do not reflect his view -- being written earlier.) Xavier wrote... > > "Prikryl,Petr" wrote: > > > Xavier wrote... > > > I would like to add new entities of HTML. I am currently interested > > > in Greek letters but all what is defined in HTML 4.0 would be > > > interesting. [...] > > There are three cases when doxygen is forced to work with character > > entities: > > > > 1. The entity is written in the sources as the entity reference > > (i.e. ø), and it is expected to be converted into other > > sequence (like "\\o{}" for LaTeX or something else for RTF, ...), > > or into binary form. > > That was the case I am interested in. Very convenient if you work > on an American keyboard and want to have French (or other) > special characters. If the output were XML, you could always use the known references to character entities. > > 2. The entity is written in binary form in the sources and it should > > be converted into a named character entity (like '&' to &) > > Again as far as I am concerned, I prefer used entities because it > will be interpreted the same whatever the charset used. (Well I hope) The "predefined" character entities that we are talking about are unambiguous. However, you cannot be sure if any browser will render them. > > For future, I wish XML would be the only form produced by doxygen > > (please, read it until the end before flaming me). Then, the more > > general approach should be taken. > > It's certainly a good idea .. if there is still possibility (integrated in > Doxygen) to have the current multiple output: HTML, LaTeX and others. Doxygen could call the XML tool internally for the generated XML output. Users want the result. They do not care how the things are implemented. > > In the XML output case in future (and also in the HTML case now), > > the easiest and the most general solution would be to recognize the > > syntax of any explicitly written entity (i.e. ø) and leave it > > untouched. > > I think this would be a good idea. Does it exist a single set of such > entities? I don't speak about numerical entities that are not especially > convenient even if it should be also availble in the case this idea is > implemented. I mean a readable set like é OK it will be > in English but it's more readable than é :) There is not a single but several sets of definitions of the entities. As the characters are assigned unique binary identification, the entities are defined unambiguosly. They have ISO definitions. The handy place to look for their names, Unicode #, Glyph, and ISO Description is for example the on-line version of "DocBook: The Definitive Guide" by Norman Walsh and Leonard Muellner. Look for "III. DocBook Character Entity Reference". > > The reason is that the entity may be either defined > > explicitly (XML, SGML), or predefined by DTD (HTML) or somehow else. > > Some entities may be predefined only in some language supports. > > > > Doxygen should be transparent and pass through the character > > entities as if it was any other word (with & and ;). > > This would be nice if you have only one output in XML or HTML > but anyway, you would need something to make conversion for > other outputs. This conversion can be done using XML tools. > > Moreover, if a special binary character is used in the source, the > > conversion process should generally consider three different > > character encodings in the same time: > > > > - source encoding > > - internal encoding of doxygen -- see TranslatorXxxx classes > > - output encoding > > > > And also, one human language may use several possible encoding > > (e.g. ISO, Windows, DOS, or some former national de facto standard) > > This is really nightmare. If doxygen should work with everything on > > the input and everything on the output, it would take a long time to > > solve the problems. > > > > Solution? XML! > > So you mean to use special character entities as a standard > for Doxygen. Not at all. Here, by "special (binary) character" I mean (unformally) any character that is not ASCII. While I can occasionally type some ø in my Czech text, I will always prefer using normal way of entering special Czech character (using Czech keyboard). They are entered into the file as bytes greater than (say) 127. Or possibly, I may use Unicode editor later. The text still should be readable -- even in the editor window. Writing my name like "Petr Přikryl" should be really rare. What I possibly could say that it should not matter whether you decided to type special character in binary form (some encoding) or using named character entities. During processing XML documents, the named character entities are replaced by the binary encoded characters (Unicode), so there is no difference whether you type in named character entity reference (with the correct entity definition) or the binary form (in the corect encoding context). Unlike XML tools, doxygen does not work internally with Unicode. This is the reason, in my opinion, why the general encoding conversion support would be very difficult to implement. > What about part \latexonly in source documentation. > It must continue to work to my opinion. If not some users > could be disappointed. Well, I was a big fan of LaTeX earlier. These days I think that DocBook XML is better for future. Anyway, I think that it is not a big problem to store LaTeX source reliably inside the intermediate XML document and extract real, LaTeX source from that XML doc. > > Case 2: binary in sources, converted for the output > > =================================================== > > > > If the source contains the special binary characters, then the HTML > > (XML, SGML) output should be able to accept also the unchanged > > binary form of the characters. In that case, the HTML (...) document > > should explicitly say what character set is used. > > > > <meta content="text/html; charset=windows-1250"> > > > > For that purpose, the TranslatorXxxx implements the method > > idLanguageCharset() which is used to produce the above mentioned > > meta tag in the HTML output. > > The charset should be precised in the input document in this case. > Otherwise there could be some mismatch. Exactly. Now, doxygen expects the single implicit encoding of input sources, single (implicit) encoding of the internal strings, and single implicit encoding of the output. If only one input encoding were used in all sources, then the encoding could be marked in the Doxyfile (configuration). However, there should be also a possibility to change the encoding on-the-fly -- i.e. some new doxygen tags. This way some internal tables could be used to unify the input, internal and output encodings. The better approach would be to produce the intermediate XML document with blocks of text with explicitly marked encoding (less work for doxygen, no problem for XML tools). > > Another question is how to produce a multi-language document. > > I am afraid that this cannot be solved without XML output as the > > final or intermediate format. > > In this case, there is one solution, I don't know if it's the only > one: Unicode. With other charset on 8 bits it's not possible > to have complete multiple language facility. Yes, exactly. Unless, you can pass the problem to the XML tools. In other words, if sources were stored using Unicode (or the like) encoding, no problem would emerge. However, this is not realistic. The majority of USERS (and also tools) is not adapted to Unicode, yet. On the other hand, XML document can be produced using normal 8bit character editors with explicit tags around marking the used encoding. The XML tools convert the 8bit chars blocks-of-text into Unicode internally, during the processing. Doxygen would not need to care if it produced the XML output postprocessed by XML tools. > > The conversion into other output formats (like RTF, LaTeX, PDF, > > etc.) should be done by XML tools. Otherwise, the doxygen would have > > to implement all the character encoding conversions, character > > entity reference conversions, and possibly other problems that are > > already implemented by the XML tools. > > OK but they must be integrated to Doxygen and transparent for > the user who prefer HTML to XML even if this can be customized > to have any output. This would be ideal. But the integration does not neccessarily means to put everything into one binary executable. > > In my personal opinion, future doxygen should focus on producing > > quality XML output ONLY -- possibly using more than one markup > > (DocBook XML be the major one, possibly also some proprietary XML in > > the sense similar to producing RTF specific for some versions of > > Microsoft Word). Focusing on a single output form can make doxygen > > lighter, faster, containing fewer bugs. > > Yes but one interesting feature of Doxygen is precisely this > multiple input/output. If using a intermediate XML output could > be interesting, Doxygen should be able to provide the other > output, even if it is done using other existing tools. And to use > the multiple input. Currently only LaTeX and HTML I believe. Yes, they are the strong and weak points at the same time. The strong point is that having one binary executable, you have everything you need. The problem is that it generally works correctly only for English and possibly for few other languages. To make it working for e.g. Czech would require some non-trivial internal changes of doxygen. Because of that, I guess that there will always be a group of users who are satisfied with doxygen as-is, and who would be against the big changes (the English language users). They may claim that there is nothing big to improve. Another weak point is that the set of generated outputs is restricted by doxygen. As far as I know, the DocBook XML output could be used to produce all the outputs that one gets from doxygen today. Similarly, the input formats are restricted by the internal parser (basically C/C++/Java) and only slightly extended via input filters. Someone named XML "the ASCII of the future", and there is a strong evidence that DocBook DTD will dominate in the area of writing technical documentation. As far as I can imagine, there are no limitations with respect what doxygen can produce now and also in future. DocBook is said to be the second most widespread DTD (HTML is the first one). However, the HTML is rather more presentation-oriented. In other words, HTML is used rather as the final form of documents. DocBook, on the other hand, is better for capturing the structure of the document with all the necessary redundancy for postprocessing into more output forms. LaTeX is somewhere between. While its main plus is the ability to capture the structure of the document, it is still very special and oriented to producing the final output, including some visual details that are not reliably convertable into other form of the output. > Now choices have to be made and precise specifications written in > a separate document. Mails are not enough to discuss seriously. Exactly. In my opinion, we should focuse on the back-end at the beginning. -- Petr Prikryl, SKIL, spol. s r.o., pri...@sk... |