When PMD's XML output includes a reference to a string in the source, it can sometimes produce values that are illegal in XML, e.g.
<violation beginline="177" endline="177" begincolumn="44" endcolumn="71" rule="AvoidDuplicateLiterals" ruleset="String and StringBuffer" package="org.apache.lucene.analysis.core" class="TestAnalyzers" method="testLowerCaseTokenizer" externalInfoUrl="https://pmd.github.io/pmd-5.3.3/pmd-java/rules/java/strings.html#AvoidDuplicateLiterals" priority="3">
The String literal "Tokenizer ��test" appears 4 times in this file; the first occurrence is on line 177
</violation>
The issue in the above block is escape sequences like � (see http://stackoverflow.com/a/5110103/574815). When an XML parser (e.g. java's xml.stream) reaches this token, it throws an exception like
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[230,44]
Message: Character reference "&#
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source) ~[na:1.8.0_20]
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source) ~[na:1.8.0_20]
It seems the issue is not invalid XML syntax, but an invalid XML character. D801 lies outside the valid XML character range.
I encountered this while analyzing the Apache Lucene codebase; the violation reported above came from this file: https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestAnalyzers.java
I'll also attach a copy of that file, and a copy of the xml output from analyzing just that file, for convenience.
I'm not exactly sure what should be done about this. Maybe the characters should be redacted from the output. Possibly better would be to represent characters that would have been invalid in xml as their java escape sequence, i.e. � would instead be \uD801
This will be fixed with PMD 5.3.6 and later.
Commit: https://github.com/pmd/pmd/commit/3393507082938c28f62d1e08cc2e39092ff277df