Extended ASCII characters (illegal characters) in doc comments

Brought to you by: dimitri

Extended ASCII characters (illegal characters) in doc comments

Forum: doxygen-users

Creator: Jonathan Sachs

Created: 2016-05-03

Updated: 2016-05-03

Jonathan Sachs - 2016-05-03

When GENERATE_XML=YES, doxygen (v1.8.7) generates XML files with the XML tag:

<?xml version='1.0' encoding='UTF-8' standalone='no'?>

However, it passes through extended ASCII characters such as the copyright symbol, 0xA9. This results in an invalid XML file.

Many XML validators and other XML consumers do not catch this error, but some do. One example is Notepad++, which gives an error like this when one tries to edit and save the XML file:

XML Parsing error at line 11:
Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x3C 0x73 0x70

Another example is Python's XML API, xml.etree.ElementTree, which throws a ParseError exception when it tries to parse the file, with a messsage like this:

not well-formed (invalid token): line 11, column 73

Is there a fix or workaround for this problem? I can't control what the consumer application does with the invalid characters, so I must either prevent doxygen from emitting them or write a separate application to filter them out.

Last edit: Jonathan Sachs 2016-05-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.