[Doxygen-develop] Doxygen and XML... (was: Status of XML development? (was Adding o f..))

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Dimitri wrote...
> 
> I see XML output as an intermediate interface, that would allow
> several front-ends to produce specific output (e.g. html output or
> something very different like code metrics). The XML output would
> contain all information, the front-ends will then pick the appropriate
> information and transform that into the actual output.

The more I think about it, the more I also incline towards internal
XML DTD (say doxygen internal XML).

> In theory there are plenty of XML tools that can transform XML output
> into something else. In practice these tools are just not there (at
> least I haven't seen them). All that there really is, is an easy way to
> parse XML and build up the structure contained in an XML file into
> structures in memory. So the plan is to provide a C++/Qt based XML 
> parser that understands doxygen's XML output. People that wish to
> add support for another output format can do so by using the structures
> build up by this parser.

I am very new to XML, but there are tools used with DocBook XML and
they are more general than only for supporting DocBook.  This
requires better analysis.

> With respect to DocBook format: I have looked at it, but I think it
> covers only 20% of what doxygen will produce. So any docbook tool
> (which are currently all SGML based by the way), wouldn't be very
> useful.

I am not sure (just starting with DocBook), but I think that DocBook
is much richer that say HTML or LaTeX and it is very suitable for
producing the end documents.  It may not fit to be used as
the internal XML format, but I would see it as the main final output
format.  Let's think about the following approach:

  input sources
   |
   +--> doxygen internal XML (by doxygen parsers)
           |
           +--> DocBook XML
           |      |
           |      +--> HTML
           |      +--> RTF
           |      +--> jadetex --> DVI, PDF, PS
           |      +--> etc.
           |
           +--> some other postprocessing of the internal doxygen XML

The important thing to note is that DocBook is not exclusively SGML
based.  While this could be the truth in the past, majority of
DocBook users probably uses DocBook XML these days.  Norman Walsh,
one of the DocBook leaders also considers the XML be the future of
DocBook.  I suggest to focus on DocBook XML exclusively (instead of
thinking about DocBook SGML).

What should be clarified is the mentioned 20% coverage of doxygen's
problems by DocBook.

> I do not know how these ideas match/conflict with the character 
> encoding problems mentioned by Petr. Would using XML like this still
> solve all those problems?

I guess that yes -- XML will always help to solve the problems.  At
least, the first parsing phase can be done without problems with
respect to encoding.  Once having the correctly marked internal XML,
all problems with languages and encoding become covered by the XML
standard.

What I see as extremely important here is to use correctly the
encoding attribute and the xml:lang attribute. This implies
neccessary splitting the XML output into separate files, at least,
based on the encoding -- if the standard approach was chosen. Here
are the reasons:

 a) If XML document consists of more than one file, one of the files
    is main (contains the DTD identification), the other files are
    read as so called "external entities" (basically &myfile1; is
    expanded as the content, the &myfile1; entity is defined inside
    one separate file).

 b) Each xml file implicitly assumes UTF-8 encoding.  If other
    encoding is used, the first line should contain:

      <?xml version="1.0" encoding="windows-1250">

    for main file or

      <?xml encoding="windows-1250">

    for the other files (i.e. the external entities).

    Then, the rest of the file contains the text encoded in the
    mentioned encoding.

    This also means that new Doxyfile option should be used for the
    implicit encoding of the input sources.  And also, new doxygen
    tag should be introduced for explicit marking the file content 
    encoding. This way, it would be possible to process project files
    with different encoding (legacy and OS dependency reasons).

 c) The language specific text (i.e. not the encoding specific but
    really things like English, French, Portuguese) can be marked so
    in any element using the xml:lang attribute.  Example (here in <para>
    but this can be inside "any" element):

      <chapter xml:lang="en">

      <para>Some text in English</para>

      <para xml:lang="fr">Bonjour (i.e. some exceptional text in
      French -- excuse mois; I have close to zero knowledge of
      French ;-).</para>

      <para xml:lang="ptBR">Brazilian Portuguese</para>
      ...
      </chapter>

    This also means that doxygen could define new tags for marking
    the other language than the base sources (human) language.  
    The Doxyfile should define new option that says what is the
    implicit language of input sources -- possibly INPUT_LANGUAGE.
    This can be (of course) different than the existing
    OUTPUT_LANGUAGE.

The sentences generated by doxygen translators can be produced as
named entities definitions into one file -- this would require
further analysis.  

IMPORTANT: The output could even (possibly) be generated
independently on the languages and the translator could possibly
collapse into one general class.  The internationalization can
possibly be done via language dependent entity rendering via DSSSL
or XSL files (I am not very good here yet.  But at least for DSSSL
it is done in DocBook this way).

I still think that DocBook XML should be the main output to files.
The internal XML coul be so much intermediate that it could exist
only in memory in the form supported by some standard XML library.

For that reason, the internal XML should use DocBook tags if the
tags should not be somehow more special (in the sense to prefer
<para> instead of <p>).

> The nice thing about having an intermediate
> file is that the parser and front-end could also be written in another 
> language such as Python. Furthermore, other input parsers could produce
> the same XML output and benefit from the availble front-ends.
> 
> In summary doxygen would consist of the following:
> 
> - the main engine as a library
> - the xml parser as a library
> - an extendable configuration parser as a library (contains the
>   config options for the engine, but can be dynamically extended by the
>   front-ends to support more options).
> - a number of front-ends, either as a libraries or as a standalone tools
> - some glue to make a user friendly tool out of these.

As far as I understand, the internal XML format will not contain any
sentences generated by doxygen translators.
The things like the text around, say, the list of places from where
the method is called, is not generated into the internal XML.
Am I right?  I would like to be ;-)

Regards,

Petr
-- 
Petr Prikryl, SKIL, spol. s r.o., pri...@sk...