Re: [Doxygen-develop] [BUG] Tag File Parsing Issue
Brought to you by:
dimitri
From: Dimitri V. H. <do...@gm...> - 2012-03-20 19:28:45
|
Hi Robert, On Mar 18, 2012, at 2:10 , Robert Abel wrote: > Hi, > > I'm currently using tag files to stitch multiple documentation files (for multiple programming languages) together. > > There seem to be two bugs related to tag/xml files: > • Tag files are statically produced with encoding='ISO-8859-1' (doxygen.cpp ll.10604). > Yet there is not one instance of a conversion function used that I could find that would actually convert any tag file output from the source input file encoding given using INPUT_ENCODING. That doesn't seem right! Indeed, all internal strings inside doxygen are UTF-8 encoded, so the tag files as well. I'll correct this. > • There is a bug in tagreader.cpp. Basically, QXmlSimpleReader (qxml.cpp) will read any XML input file according to its encoding stated in each file. However, the TagFileParser handler in tagreader.cpp will store all incoming QString (16bit) strings inside m_curString which is a QCString (8bit) inside bool characters (tagreader.cpp ll. 789). > This effectively annihilates the correctly parsed XML source encoding when curString is assigned to different information entities, e.g. when assigning group titles in ll. 664. While I'm not 100% sure what happens at this implicit conversion, I reckon the QString will be using the thread locale to convert the QCString back to 16bit, thus resulting in gibberish when thread locale and XML encoding mismatch. > As a quick fix for 2.), I changed the declaration of m_curString to QString so no conversions take place (but there may be memory overhead wrt explicit/implicit sharing I read?). I didn't notice any immediate problems with this hack. I plan to remove to implicit conversion from QString to QCString and use an explicit .utf8() everywhere. > > As a fix for 1.) I propose to either actually convert to ISO-8859-1 from the INPUT_ENCODING, or just declare the XML file to be encoded using INPUT_ENCODING. The latter would be simplest and cleanest, IMHO. > Also, please notice that 2.) cannot just be fixed by fixing 1.), since tag files might be produced by "3rd party" software using any encoding they wish. No, I would state that doxygen requires tag files to be UTF-8 encoded, rather than supporting arbitrary encodings or depending on INPUT_ENCODING. Tag file are files to link different projects, which could have different INPUT_ENCODING settings. > The quick and dirty fix for 2.) I did would probably need some revisitation by someone who knows about the memory overhead/sharing capabilities involved and can decide on a proper course of action. (Which is why I'm hesitant to post a [one-line] patch...) Will do. I try to get rid of QString as much as possible. Regards, Dimitri |