Thread: [Doxygen-develop] strategies for XHTML support
Brought to you by:
dimitri
From: Francesco M. <f18...@ya...> - 2008-03-02 09:47:20
Attachments:
html_to_xhtml.tar.bz2
|
Hi, before making a kilometric patch I'd like to discuss the changes I think are necessary for XHTML support: 1) split BaseOutputDocInterface::writeListItem into two functions: startItemListItem, endItemListItem, which allow to put also the necessary </li> correctly everywhere required. 2) split BaseOutputDocInterface::writeDescItem into two functions: startDescForItem, endDescForItem, which allow to put also the necessary </dd> correctly everywhere required. 3) remove deprecated HTML4 attributes (compact, nowrap, some align, etc) from htmlgen.cpp and htmldocvisitor.cpp; turn empty tags from e.g. <br> to <br/>; rename 'name' attribute to 'id' attribute 4) change MemberDef::setAnchor to generate an anchor ID which always starts with a non-numeric character (as required by 'id' attribute); I've done this just prefixing the MD5 with 'a' 5) most tedious one: couple _every_ startParagraph/newParagraph call to a endParagraph one I attach the _preview_ version of a patch doing these 5 things to show you better what I mean... Please tell me if you think I'm doing something wrong: I hate do useless work :) Francesco |
From: Francesco M. <f18...@ya...> - 2008-03-02 13:01:08
|
Francesco Montorsi ha scritto: > I attach the _preview_ version of a patch doing these 5 things to show > you better what I mean... I'm a bit blocked by the table generation code -- I really have troubles to understand what it does and why. I must say that all the output generation code is quite messy. I've attached a patch with the results I obtained so far here: http://bugzilla.gnome.org/show_bug.cgi?id=519886 I hope to be able to complete it in future but I've realized that in the near term I must seek another solution to my problems (*). btw if you are not interested to reach 100% well-formness in a single patch, then the one I've attached seems to work quite well in terms of output rendering (i.e. there are no big differences to the std doxygen HTML4, just some spacing differences). I'm not sure however it well-behaves respect the other output formats... Francesco (*) = I wanted to enable XHTML output in order to use XSLT stylesheets over it, instead of doing it over the doxygen XML output. |
From: <do...@ke...> - 2008-03-02 14:23:14
|
On Sun, Mar 02, 2008 at 02:00:35PM +0100, Francesco Montorsi wrote: > btw if you are not interested to reach 100% well-formness in a single > patch, then the one I've attached seems to work quite well in terms of > output rendering (i.e. there are no big differences to the std doxygen > HTML4, just some spacing differences). I'm not sure however it > well-behaves respect the other output formats... It might also break some post-processing some places use. I know several companies which post-process Doxygen output to create their own documentation, but I don't know how robust their processing is. > (*) = I wanted to enable XHTML output in order to use XSLT stylesheets > over it, instead of doing it over the doxygen XML output. Have you tried processing it with the W3C 'tidy' program? That usually does a pretty good job of producing XHTML from HTML with close tags missing (what lynx calls "tag soup"), and will produce XML as well as XHTML output. (Doing it on the number of files Doxygen creates is a pain and slow, though, and you need to disable its comments about how 'bad' the original is.) Chris C |
From: Francesco M. <f18...@ya...> - 2008-03-05 14:15:14
|
Hi, Francesco Montorsi ha scritto: > In conclusion: I need a pause and some help to complete this patch :) > > What's your (doxygen team) interest toward XHTML? > Isn't it one of your priorities? so, there's no interest in XHTML development? Noone willing to help me? That's a pity... Francesco |
From: Francesco M. <f18...@ya...> - 2008-03-02 15:01:12
|
do...@ke... ha scritto: > On Sun, Mar 02, 2008 at 02:00:35PM +0100, Francesco Montorsi wrote: > >> btw if you are not interested to reach 100% well-formness in a single >> patch, then the one I've attached seems to work quite well in terms of >> output rendering (i.e. there are no big differences to the std doxygen >> HTML4, just some spacing differences). I'm not sure however it >> well-behaves respect the other output formats... > > It might also break some post-processing some places use. I know > several companies which post-process Doxygen output to create their own > documentation, but I don't know how robust their processing is. I think that the postprocessing of the HTML output will be much simplified if doxygen starts outputting XHTML instead of HTML4, which is not valid XML. Companies doing this kind of postprocessing will eventually need some changes to their scripts but this is probably true after all doxygen releases since the structure of the generated HTML is not granted to remain the same and in fact, most times it changes from a release to another. >> (*) = I wanted to enable XHTML output in order to use XSLT stylesheets >> over it, instead of doing it over the doxygen XML output. > > Have you tried processing it with the W3C 'tidy' program? That usually > does a pretty good job of producing XHTML from HTML with close tags > missing (what lynx calls "tag soup"), and will produce XML as well as > XHTML output. (Doing it on the number of files Doxygen creates is a > pain and slow, though, and you need to disable its comments about how > 'bad' the original is.) tidy does a good job but I think it's a "dirty" solution: its output is not granted to be the "right" one (it repairs the HTML as best as it can but it's still a machine and can't look at the context to understand what's the right fix) and may generate rendering artefacts (caused by syntatically correct but semanthically wrong markup). It's true that cleaning with 'tidy' the generated XHTML of the doxygen samples (I'm testing it with my patch applied) it shrinks the validation errors from about 700 to about 30 (great!!) but still those 30 needs human revision. In the bigger project which I'm trying to convert to Doxygen (FYI it's wxWidgets), there would be still hundreds of errors to handle by hand. Not feasible. It's the doxygen output which should be correct without any further processing. Doxygen cannot continue to produce HTML4 forever (*)! Technologies are evolving and the switch from HTML4 to XHTML I think is worth some troubles/regressions. It's just that sometimes I think that all doxygen sources should be entirely rewritten and reorganized (with more comments!!) in order to fix all of these errors. In conclusion: I need a pause and some help to complete this patch :) What's your (doxygen team) interest toward XHTML? Isn't it one of your priorities? Francesco (*) = I also strongly doubt it produces VALID html4 now; testing it is not easy as doing an HTML4 validation test is much more difficult than doing an XHTML validation test and requires for me to upload file by file the generated output to the w3c validator. |
From: <do...@ke...> - 2008-03-02 15:34:57
|
On Sun, Mar 02, 2008 at 04:00:10PM +0100, Francesco Montorsi wrote: > I think that the postprocessing of the HTML output will be much > simplified if doxygen starts outputting XHTML instead of HTML4, which is > not valid XML. Certainly it should be easier to parse if it is valid XML, if they were starting from scratch. The problem is wiht existing translators expecting the non-valid format. > Companies doing this kind of postprocessing will eventually need some > changes to their scripts but this is probably true after all doxygen > releases since the structure of the generated HTML is not granted to > remain the same and in fact, most times it changes from a release to > another. I wonder how much it does. I don't know, I'm not directly in touch with those places which do that sort of transformation. > Doxygen cannot continue to produce HTML4 forever (*)! > Technologies are evolving and the switch from HTML4 to XHTML I think is > worth some troubles/regressions. > > It's just that sometimes I think that all doxygen sources should be > entirely rewritten and reorganized (with more comments!!) in order to > fix all of these errors. I have had the feeling that what it should be producing is just XML, and then have back-ends which produce whatever other formats people want (XSLT could do most of them). "Rewrite from scratch" is my mantra with almost everything (especially my own code), but the time and effort to do that tends to be prohibitive. Especially when what's there is 'almost' right. > (*) = I also strongly doubt it produces VALID html4 now; testing it is > not easy as doing an HTML4 validation test is much more difficult than > doing an XHTML validation test and requires for me to upload file by > file the generated output to the w3c validator. Don't they do a stand-alone validator? Most people do want to validate entire sites or at least sets of pages. Chris C |
From: Francesco M. <f18...@ya...> - 2008-03-02 16:40:20
|
do...@ke... ha scritto: > On Sun, Mar 02, 2008 at 04:00:10PM +0100, Francesco Montorsi wrote: > >> I think that the postprocessing of the HTML output will be much >> simplified if doxygen starts outputting XHTML instead of HTML4, which is >> not valid XML. > > Certainly it should be easier to parse if it is valid XML, if they were > starting from scratch. The problem is wiht existing translators > expecting the non-valid format. this is not IMO a good reason to continue generating HTML4 instead XHTML... >> Companies doing this kind of postprocessing will eventually need some >> changes to their scripts but this is probably true after all doxygen >> releases since the structure of the generated HTML is not granted to >> remain the same and in fact, most times it changes from a release to >> another. > > I wonder how much it does. I don't know, I'm not directly in touch with > those places which do that sort of transformation. I'm not, too so I don't know for sure... >> Doxygen cannot continue to produce HTML4 forever (*)! >> Technologies are evolving and the switch from HTML4 to XHTML I think is >> worth some troubles/regressions. >> >> It's just that sometimes I think that all doxygen sources should be >> entirely rewritten and reorganized (with more comments!!) in order to >> fix all of these errors. > > I have had the feeling that what it should be producing is just XML, and > then have back-ends which produce whatever other formats people want > (XSLT could do most of them). does any backend based on the doxygen XML output exist? I fear that using just XSLT it's going to be very difficult to generate something which resembles the current doxygen HTML output. > "Rewrite from scratch" is my mantra with almost everything (especially > my own code), but the time and effort to do that tends to be > prohibitive. Especially when what's there is 'almost' right. however I think that with the current startXXX()/endXXX() paradigm it's too easy to make errors and forget e.g. a closing tag somewhere. If doxygen used a more object-oriented approach: OutputNode *n = outputList->appendRootNode(); OutputNode *p = n->appendParagraph(); p->writeClassMemberList(); ... outputList->dumpOutputTree(); it would be impossible to forget closing tags (each output node would write for HTML <myself>[children nodes]</myself>) or to generate invalid trees. Obviously this approach works well only for tree-structured docs like (X)HTML. I fear that Doxygen (at least when it was started) placed too emphasis on the generation of other formats like man or latex. now HTML is by far the most important format it generates and if the *def.cpp files were coded in a way like that mentioned above, the generated HTML would be of higher quality with less programming efforts (shorter and more readable code). >> (*) = I also strongly doubt it produces VALID html4 now; testing it is >> not easy as doing an HTML4 validation test is much more difficult than >> doing an XHTML validation test and requires for me to upload file by >> file the generated output to the w3c validator. > > Don't they do a stand-alone validator? Most people do want to validate > entire sites or at least sets of pages. there's no free command-line validator for HTML4 AFAIK. w3c publishes the sources of his validator but you can install it locally only setting up an apache installation. And anyway you still have to validate each file by hand. Not feasible for projects with big documentation file sets. XHTML is way easier to validate. In the patch I proposed I attached an archive which contains a simple script which allows to validate from command-line an arbitrary number of HTML files and nicely reports all erors in a log file. Francesco |
From: James D. <jam...@gm...> - 2008-03-02 17:50:41
|
On Sun, Mar 2, 2008 at 1:47 AM, Francesco Montorsi <f18...@ya...> wrote: (lots of good stuff) > 3) remove deprecated HTML4 attributes (compact, nowrap, some align, etc) > from htmlgen.cpp and htmldocvisitor.cpp; turn empty tags from e.g. > <br> to <br/>; rename 'name' attribute to 'id' attribute One trivial point: while "<br/>" is correct, so is "<br />" (with an extra space before the closing "/>") and last time I checked (converting a small website from HTML to xhtml) was supported by a wider range of HTML processors. -- James |
From: <do...@ke...> - 2008-03-06 08:37:16
|
On Wed, Mar 05, 2008 at 03:12:49PM +0100, Francesco Montorsi wrote: > Francesco Montorsi ha scritto: > > In conclusion: I need a pause and some help to complete this patch :) > > > > What's your (doxygen team) interest toward XHTML? > > Isn't it one of your priorities? > so, there's no interest in XHTML development? > Noone willing to help me? I'm willing to help. I'm very behind on the internals of Doxygen, though, so I don't know how much help I can be. What do you need other people to help with? Chris C |
From: Francesco M. <f18...@ya...> - 2008-03-13 15:09:08
|
Hi, sorry for the big delay but somehow I missed this reply up to now. do...@ke... ha scritto: > On Wed, Mar 05, 2008 at 03:12:49PM +0100, Francesco Montorsi wrote: > >> Francesco Montorsi ha scritto: >>> In conclusion: I need a pause and some help to complete this patch :) >>> >>> What's your (doxygen team) interest toward XHTML? >>> Isn't it one of your priorities? >> so, there's no interest in XHTML development? >> Noone willing to help me? > > I'm willing to help. Great! > I'm very behind on the internals of Doxygen, > though, so I don't know how much help I can be. > > What do you need other people to help with? basically with testing of the patch and with further fixes to doxygen sources; in particular to force it to generate all needed </tr> and </td> tags. Steps to help: 1) apply my patch to doxygen trunk 2) unzip the validator stuff in the examples folder 3) enable the #define DBG_HTML(x) macro to return "x" in htmlgen.cpp 4) compile doxygen; run it on the examples and then run the "./validate_xhtml" script 5) look at the validation file called "log": there are bunch of errors there which need to be fixed in the corresponding *def.cpp source files. E.g. I get as first error: define/html/define_8h.html:68: parser error : Opening and ending tag mismatch: tr line 56 and table </table> ^ You can see in that html file that there is a missing </tr> tag: <!-- startMemberDocName --> <table class="memname"> <tr> <td class="memname">#define MAX<!-- endMemberDocName --> </td> <!-- startParameterList --> <td>(</td> <!-- startFirstParameterType --> <td class="paramtype">x, <!-- endParameterType --> </td> <!-- startParameterType --> <tr> <td class="paramkey"></td> <td></td> <td class="paramtype">y<!-- endParameterType --> </td> <!-- startParameterName --> <td class="paramname"><!-- endParameterName --> </td> <td> ) </td> <td> ((x)>(y)?(x):(y))<!-- endParameterList --> </td> </tr> <!-- endMemberDoc --> </table> this is due to the fact that after startFirstParameterType() is called, and after endParameterType() is called, there's no </tr> added. This is because (AFAIUI) after the parameter type it should always be spit out the parameter name, which puts the </tr>. In this case it wasn't called... why? etc.... to be continued... :) Thanks for any help!! Francesco |