Thread: [Doxygen-develop] Adding of new (all) HTML entities?
Brought to you by:
dimitri
From: Xavier O. <xav...@an...> - 2001-07-19 10:37:44
|
Hi, Proposition: I would like to add new entities of HTML. I am currently interested in Greek letters but all what is defined in HTML 4.0 would be interesting. There are 3 files containing all these entities: HTMLlat1.ent.htm (Doxygen already have some of them) HTMLspecial.ent.htm (Doxygen already have some of them) HTMLsymbol.ent.htm There are available at: http://www.w3.org/TR/1998/REC-html40-19980424/struct/global.html#idx-entity_sets. Does someone is working on that? I don't know exactly what is to be changed. I have made a search for sharpS. One entity already implemented and I have found: Searching for 'SharpS'... C:\doxygen-1.2.8.1\src\doc.cpp(17976):{ outDoc->writeSharpS(); } C:\doxygen-1.2.8.1\src\htmlgen.h(177): void writeSharpS() { t << "ß"; } C:\doxygen-1.2.8.1\src\latexgen.h(182): void writeSharpS() { t << "\"s"; } C:\doxygen-1.2.8.1\src\mangen.h(166): void writeSharpS() { t << "s\\*:"; /* just a wild guess, C:\doxygen-1.2.8.1\src\outputgen.h(207): virtual void writeSharpS() = 0; C:\doxygen-1.2.8.1\src\outputlist.h(301): void writeSharpS() C:\doxygen-1.2.8.1\src\outputlist.h(302): { forall(&OutputGenerator::writeSharpS); } C:\doxygen-1.2.8.1\src\rtfgen.h(164): void writeSharpS() { t << "\337"; } 8 occurrence(s) have been found. As far as I understand, what is to be done for each new entity is the following: -1 Add a new pure virtual method named writeEntity() in class BaseOutputDocInterface (outputgen.h) -2 Add a new method void writeEntity() { forall(&OutputGenerator::writeEntity); } in class OutputList (outputlist.h) -3 Add non pure virtual method in generators: void writeEntity() {} in classes class ManGenerator (mangen.h) class LatexGenerator (latexgen.h) class HtmlGenerator (htmlgen.h) Remarks: I do not understand what doc.cpp does really. I suppose it has also to be changed but I don't know how. Besides, I know what to do with HTML and LaTeX (maybe not for all symbols, I have to check in my documentation) but I don't know what to do with man. Questions: Is the process (with the 3 steps) correct? Is something missing? As my free time is so precious (as for everybody), is it possible tha I only change the files and send them back for merging, compiling and so on? Greetings, Xavier. -- Artificial Anthill Project http://www.aanthill.org/ mailto:aan...@aa... D2SET Non Profit Association http://www.d2set.org/ mailto:d2...@d2... |
From: Dimitri v. H. <di...@st...> - 2001-07-23 09:23:33
|
On Thu, Jul 19, 2001 at 12:37:10PM +0200, Xavier Outhier wrote: > Hi, > > > As far as I understand, what is to be done for each new entity > is the following: > > -1 Add a new pure virtual method named writeEntity() in > class BaseOutputDocInterface (outputgen.h) > -2 Add a new method > void writeEntity() > { forall(&OutputGenerator::writeEntity); } > in class OutputList (outputlist.h) > -3 Add non pure virtual method in generators: > void writeEntity() {} > in classes > class ManGenerator (mangen.h) > class LatexGenerator (latexgen.h) > class HtmlGenerator (htmlgen.h) > > Remarks: > I do not understand what doc.cpp does really. I suppose it has also > to be changed but I don't know how. doc.cpp is a generated file. Look at the flex file doc.l: You'll see lines like these: <DocScan,Text>"&"[cC]"cedil;" { outDoc->writeCCedil(yytext[1]); } these consist of a set of states (between <>), a regular expression, and some code. The code is executed when the input matches the regular expression. yytext is a char* containing the actual text that was matched. So the above matches ç and Ç > Besides, I know what to do with HTML and LaTeX (maybe not for > all symbols, I have to check in my documentation) but I don't know > what to do with man. This has always been a mistery to me too. > Questions: > Is the process (with the 3 steps) correct? Is something missing? Seems ok to me. > As my free time is so precious (as for everybody), is it possible > tha I only change the files and send them back for merging, > compiling and so on? Yes, that is fine. You can choose to do only those entities that you need as well. Regards, Dimitri |
From: Xavier O. <xav...@an...> - 2001-07-23 13:37:04
|
Dimitri van Heesch wrote: > On Thu, Jul 19, 2001 at 12:37:10PM +0200, Xavier Outhier wrote: > > Hi, > > > > > > As far as I understand, what is to be done for each new entity > > is the following: > > > > -1 Add a new pure virtual method named writeEntity() in > > class BaseOutputDocInterface (outputgen.h) > > -2 Add a new method > > void writeEntity() > > { forall(&OutputGenerator::writeEntity); } > > in class OutputList (outputlist.h) > > -3 Add non pure virtual method in generators: > > void writeEntity() {} > > in classes > > class ManGenerator (mangen.h) > > class LatexGenerator (latexgen.h) > > class HtmlGenerator (htmlgen.h) > > > > Remarks: > > I do not understand what doc.cpp does really. I suppose it has also > > to be changed but I don't know how. > > doc.cpp is a generated file. Look at the flex file doc.l: You'll see lines like > these: > > <DocScan,Text>"&"[cC]"cedil;" { outDoc->writeCCedil(yytext[1]); } > > these consist of a set of states (between <>), a regular expression, > and some code. > > The code is executed when the input matches the regular expression. yytext > is a char* containing the actual text that was matched. So the > above matches ç and Ç > > > Besides, I know what to do with HTML and LaTeX (maybe not for > > all symbols, I have to check in my documentation) but I don't know > > what to do with man. > > This has always been a mistery to me too. The mystery will be thicker now. > > Questions: > > Is the process (with the 3 steps) correct? Is something missing? > > Seems ok to me. > > > As my free time is so precious (as for everybody), is it possible > > tha I only change the files and send them back for merging, > > compiling and so on? > > Yes, that is fine. You can choose to do only those entities that you > need as well. [...] Until discussion with Petr is not over, I will do it. I wonder if the work, allowing all entities to be transparent for Doxygen, could not be done in a straight forward ... at least for HTML output. I propose that: -1 Add a new method in BaseOutputDocInterface virtual void writeEntity(const char* const text) = 0; -2 Add a new method void writeEntity() { forall(&OutputGenerator::Entity); } in class OutputList (outputlist.h) -3 Add non pure virtual method in generators: void writeSpecCharEntity() {} in classes class ManGenerator (mangen.h) class LatexGenerator (latexgen.h) class HtmlGenerator (htmlgen.h) The straight forwardness is for HTML only (later intermediate XML): void HtmlGenerator::writeEntity(const char* const text) { t << "&" << text << ";" ;} doc.l would be modify and this line added: <DocScan,Text>"&"[a-zA-Z0-9]{1}";" { outDoc->writeEntity(yytext[1]); } <DocScan,Text>"&"[a-zA-Z0-9]{2}";" { outDoc->writeEntity(yytext[2]); } <DocScan,Text>"&"[a-zA-Z0-9]{3}";" { outDoc->writeEntity(yytext[3]); } <DocScan,Text>"&"[a-zA-Z0-9]{4}";" { outDoc->writeEntity(yytext[4]); } I have several questions...: -Q1: Are the following single line could replace the 4 lines? <DocScan,Text>"&"[a-zA-Z0-9]{1,4}";" { outDoc->writeEntity(yytext[4]); } (I'm very lazy! :-) -Q2: There is a theorical problem: what is the order of scanning? to be backward compatible, the added regex should be check if nothing else has already matched. -Q3: Could it be all what to do if later Doxygen use XML (DocBook or other)? Maybe, we would have to remove all the single entities already added? ... and some remarks: - R1: If matching order is configurable, then it would solved all HTML entities (and SGML, sorry I'm really not a specialist of this domain), wouldn't be? - R2: For LaTeX, RTF and man (but not XML), more work should be done. Is a big switch reasonable in this case? There would be 255 cases if all entities defined in HTML 4.0 are used. But if the current discussion on XML leads to have a runnable version in less that 2 or 3 months, it maybe not useful to do that work. If it's for longer term (I think it will), then I could begin to do that at least for LaTeX. - R3: If the "one line" solution is not possible I will only do the changes for greek letters (HTML and LaTeX) that could also be useful for others. Waiting the answers to all my questions I will work a little for me. :-) Xavier. |
From: John O. <jo...@ly...> - 2001-07-23 15:16:48
|
> doc.l would be modify and this line added: > <DocScan,Text>"&"[a-zA-Z0-9]{1}";" { outDoc->writeEntity(yytext[1]); } > <DocScan,Text>"&"[a-zA-Z0-9]{2}";" { outDoc->writeEntity(yytext[2]); } > <DocScan,Text>"&"[a-zA-Z0-9]{3}";" { outDoc->writeEntity(yytext[3]); } > <DocScan,Text>"&"[a-zA-Z0-9]{4}";" { outDoc->writeEntity(yytext[4]); } > > I have several questions...: > > -Q1: Are the following single line could replace the 4 lines? > <DocScan,Text>"&"[a-zA-Z0-9]{1,4}";" { outDoc->writeEntity(yytext[4]); } > (I'm very lazy! :-) > Why do you have yytext[1] etc. in the above example and only yytext[4] in your suggested replacement? > -Q2: There is a theorical problem: what is the order of scanning? to be > backward compatible, the added regex should be check if nothing > else has already matched. > If you read the flex manual page everything will be explained. ;) The parser generated by flex picks the longest matching rule it can find of all active rules. One way for a rule to become active is to call for it explicitly, i.e. BEGIN(DocScan) will activate all rules with DocScan in them. BTW. I have now successfully implemented my tags so that I can write @run foo @filename bar gazonk @endrun *and* @file @run foo @filename bar gazonk @endrun I had to modify (among other things) the function findFileDef found in util.cpp so that it recognizes ClearCase version numbers. The command @run cleartool ls @filename @endrun would result in something like this /view/r10_epkjols_garp/vobs/bscrp/r10_new_trh/foo.c@@/main/r10_predev/CHECKEDOUT from /main/r10_predev/1 Rule: CHECKEDOUT beeing returned from cleartool (in the case when a file is checked out). My changes makes the findFileDef function aware of the @-signes in the string and when comparing against already found files it ignores everything after the first @-sign (including the sign itself). I have also added a new method to the FileDef class called setVersion() which stores a QCString containing version number infomration about the file, i.e. in my case everything after the @-signes including them as well. Then in the writeDocumentation() method in FileDef I modified the title etc. to use my getVersionName() method which returns the filename combined with version infomration. I have also modified the layout of the generated HTML-pages so the result looks more like what Java2 JavaDoc outputs. The reason for doing so was simply that when you have code containing several pages of #define statements and (in the same file) several pages of functions it is very difficult to see where the different sections start and end. IMHO the Java2 way of solving this by using lightblue background for the headings solves this problem nicely. When doing these changes to the layout I found out that the pages containing the lists of todo:s, bug:s and test:s uses a different to generate the headings. They use startSection() instead of startTitle(). Is this a bug, and if not, what is the design reason for doing it that way? As a result of changing the layout, I came quickly to the conclusion that the layout of the generated files should not be hardcoded. Instead I think that a user should be able to configure this via external file(s). One possibility would be to have the document layout of the different kinds of pages defined using XML. That is (for example), instead of having configuration options for which of the different dot-diagrams I want to have, I modify an XML file. Just to give you an idea of what I'am after I'll give you an example of what such a file could look like <sourcecode> <tagDefinitions> <define heading> <before data="<table border=1 cellpadding=3 cellspacing=0 width=100%> <tr bgcolor=#CCCCFF> <td colspan=1><font size=+2><b>"/> <after data=" </b></font></td> </tr> </table><br>" </define> <define include> <before data="<code>"/> <after data="</code><br>"/> </define> etc. </tagDefinitions> <layout> <tag name="heading"> <tag name="include"> etc. </layout> </sourcecode> If a user wants the include section repeated in the detailed section, you do not have to modify the source code, just modify the layout section in the XML file. The same goes if you want your headings in some other way than the default provided etc. etc. Ofcourse you have to have different layouts for the different targets (HTML, LaTeX, man, RTF, ...), but you can never avoid that. The question is if that information should be hardcoded or not... /John |
From: Xavier O. <xav...@an...> - 2001-07-24 08:29:02
|
John Olsson wrote: > > doc.l would be modify and this line added: > > <DocScan,Text>"&"[a-zA-Z0-9]{1}";" { outDoc->writeEntity(yytext[1]); } > > <DocScan,Text>"&"[a-zA-Z0-9]{2}";" { outDoc->writeEntity(yytext[2]); } > > <DocScan,Text>"&"[a-zA-Z0-9]{3}";" { outDoc->writeEntity(yytext[3]); } > > <DocScan,Text>"&"[a-zA-Z0-9]{4}";" { outDoc->writeEntity(yytext[4]); } > > > > I have several questions...: > > > > -Q1: Are the following single line could replace the 4 lines? > > <DocScan,Text>"&"[a-zA-Z0-9]{1,4}";" { outDoc->writeEntity(yytext[4]); } > > (I'm very lazy! :-) > > > Why do you have yytext[1] etc. in the above example and only yytext[4] in > your suggested replacement? -1 In the regex, there is [a-zA-Z0-9]{$} where $ is 1,2,3 or 4. I think this means repeat $ times an alphanumeric value. (I'm using Regular Expression syntax of Python 1.5, I hope that's the sames rules Doxygen uses). -2 I have supposed that yytext[1] means one character, yytext[2] ,... But it means, the second character, the third and so on It was alos not correct because α has more than 4 characters! I ha in mind numeric entities like ӝ. Well ... So now, I think that it would be more correct to have: <DocScan,Text>"&"[a-zA-Z0-9]{1,}";" { outDoc->writeEntity(yytext,strlen(yytext)); } void HtmlGenerator::writeEntity(const char* const text, int length) { if (text) { char c; int textlength = length; while (textlength) { c=*text++; t << c; textlength--; } } } > > -Q2: There is a theorical problem: what is the order of scanning? to be > > backward compatible, the added regex should be check if nothing > > else has already matched. > > > If you read the flex manual page everything will be explained. ;) > > The parser generated by flex picks the longest matching rule it can find > of all active rules. One way for a rule to become active is to call for it > explicitly, i.e. BEGIN(DocScan) will activate all rules with DocScan in > them. By longest matching, you certainly means the more restrictive, the more precise. So I think this will work.I will provide changes soon. Sorry for newbie questions. I just wanted to help a little (for my needs first), I'm simply maintainer of the French location of Doxygen. :-) [...] Thanks, Xavier. -- Artificial Anthill Project http://www.aanthill.org/ mailto:aan...@aa... D2SET Non Profit Association http://www.d2set.org/ mailto:d2...@d2... |
From: John O. <jo...@ly...> - 2001-07-24 09:01:25
|
> -1 In the regex, there is [a-zA-Z0-9]{$} where $ is 1,2,3 or 4. > I think this means repeat $ times an alphanumeric value. > (I'm using Regular Expression syntax of Python 1.5, I hope > that's the sames rules Doxygen uses). > It is the regexp rules of flex you are using. Read the man-page for flex for the full documentation of the regexp syntax available. Here is a short snippet PATTERNS The patterns in the input are written using an extended set of regular expressions. These are: x match the character 'x' . any character (byte) except newline [xyz] a "character class"; in this case, the pattern matches either an 'x', a 'y', or a 'z' [abj-oZ] a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z' [^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. [^A-Z\n] any character EXCEPT an uppercase letter or a newline r* zero or more r's, where r is any regular expression r+ one or more r's r? zero or one r's (that is, "an optional r") r{2,5} anywhere from two to five r's r{2,} two or more r's r{4} exactly 4 r's {name} the expansion of the "name" definition (see above) "[xyz]\"foo" the literal string: [xyz]"foo \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*') \0 a NUL character (ASCII code 0) \123 the character with octal value 123 \x2a the character with hexadecimal value 2a (r) match an r; parentheses are used to override precedence (see below) (this is only a part of the regular expressions available) But I think that you can use the {$} syntax if I understand the above text correctly. :) > By longest matching, you certainly means the more restrictive, the more > precise. So I think this will work.I will provide changes soon. > Yes. :) > Sorry for newbie questions. I just wanted to help a little (for my needs > first), I'm simply maintainer of the French location of Doxygen. :-) > [...] > Aren't we all newbies? ;) If you add the '-r' switch to flex when it generates the C-source code it will insert debug printouts, giving verbose information about what happens when it parses the text. Also, the default behaviour of flex is to print everything it can't match to stdout (this is the default rule which is always active). However, the flex code in doc.l (and scanner.l) has defined a rule which overrides the default rule which silently ignores text it can't match. If you modify the "match-all" rule so it looks like this <*>. { ECHO; } you will "restore" the default behaviour (I think). /John |
From: Xavier O. <xav...@an...> - 2001-07-24 14:02:32
|
John Olsson wrote: > [...]> Sorry for newbie questions. I just wanted to help a little (for my > needs > > first), I'm simply maintainer of the French location of Doxygen. :-) > > [...] > > > Aren't we all newbies? ;) I think I'm the newest one. ;-) > [...] > Also, the default behaviour of flex is to print everything it can't match > to stdout (this is the default rule which is always active). However, the > flex code in doc.l (and scanner.l) has defined a rule which overrides the > default rule which silently ignores text it can't match. If you modify the > "match-all" rule so it looks like this > > <*>. { ECHO; } > > you will "restore" the default behaviour (I think). I don't know the reason why it is such. I'm not sure 'restoring' would be a good idea. I prefer to restrict to entities in the form &entity;. So here are my changes in sources. Files modified are: outputgen.h // new method writeEntity(const char* const entity) = 0; or writeEntity(const char* const entity, int length) = 0; depending if char* yytext has a trailing \0 or not, cf my PS. outputlist.h mangen.h // empty implementation latexgen.h // empty implementation rtfgen.h // empty implementation html.h // tranparent entities for HMTL doc.l // new match for all entities In all files writeEntity() is just behind writeCCedil(). Sorry, I have not time to compile and so on (I'm not under Linux :( ). Tell me if there is any bug. I will try to implement the method for LatexGenerator if I have free time. In any case, I will not implement nothing for ManGenerator and RTFGenerator Greetings, Xavier. PS: Because I don't know if yytext ends with '\0', I include 2 zip: -1 transparent-SGML-entities-no length parameter.zip To be used if yytext does _not_ include trailing '\0'. strlen(yytext) cannot be used. -2 transparent-SGML-entities-no length parameter.zip To be used if yytext does include trailing '\0'. strlen(yytext) can be used OK, I could read flex doc but that is faster. Hope you will not quarrel me about that. :) |