Menu

#1425 [core] XMLRenderer: Invalid XML Characters in Output

PMD-5.3.6
closed
core (10)
PMD
4-Minor
Bug
5.3.3
AvoidDuplicateLiterals
2016-06-25
2015-10-08
No

When PMD's XML output includes a reference to a string in the source, it can sometimes produce values that are illegal in XML, e.g.

<violation beginline="177" endline="177" begincolumn="44" endcolumn="71" rule="AvoidDuplicateLiterals" ruleset="String and StringBuffer" package="org.apache.lucene.analysis.core" class="TestAnalyzers" method="testLowerCaseTokenizer" externalInfoUrl="https://pmd.github.io/pmd-5.3.3/pmd-java/rules/java/strings.html#AvoidDuplicateLiterals" priority="3">
The String literal &quot;Tokenizer &#xd801;&#xdc1c;test&quot; appears 4 times in this file; the first occurrence is on line 177
</violation>

The issue in the above block is escape sequences like &#xd801; (see http://stackoverflow.com/a/5110103/574815). When an XML parser (e.g. java's xml.stream) reaches this token, it throws an exception like

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[230,44]
Message: Character reference "&#
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source) ~[na:1.8.0_20]
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source) ~[na:1.8.0_20]

It seems the issue is not invalid XML syntax, but an invalid XML character. D801 lies outside the valid XML character range.

I encountered this while analyzing the Apache Lucene codebase; the violation reported above came from this file: https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestAnalyzers.java

I'll also attach a copy of that file, and a copy of the xml output from analyzing just that file, for convenience.

I'm not exactly sure what should be done about this. Maybe the characters should be redacted from the output. Possibly better would be to represent characters that would have been invalid in xml as their java escape sequence, i.e. &#xD801; would instead be \uD801

2 Attachments

Discussion

  • Andreas Dangel

    Andreas Dangel - 2015-10-16
    • status: open --> closed
    • assigned_to: Andreas Dangel
    • Milestone: New Tickets --> PMD-5.3.6
     
  • Andreas Dangel

    Andreas Dangel - 2016-06-25
    • labels: --> core
    • summary: Invalid XML Characters in Output --> [core] XMLRenderer: Invalid XML Characters in Output
     

Log in to post a comment.

MongoDB Logo MongoDB