PMD / Issues / #1425 [core] XMLRenderer: Invalid XML Characters in Output

#1425 [core] XMLRenderer: Invalid XML Characters in Output

Milestone: PMD-5.3.6

Status: closed

Owner: Andreas Dangel

Labels: core (10)

Module: PMD

Priority: 4-Minor

Type: Bug

Affects version: 5.3.3

Ruleset / Rule: AvoidDuplicateLiterals

Updated: 2016-06-25

Created: 2015-10-08

Creator: Dylan Halperin

Private: No

When PMD's XML output includes a reference to a string in the source, it can sometimes produce values that are illegal in XML, e.g.

<violation beginline="177" endline="177" begincolumn="44" endcolumn="71" rule="AvoidDuplicateLiterals" ruleset="String and StringBuffer" package="org.apache.lucene.analysis.core" class="TestAnalyzers" method="testLowerCaseTokenizer" externalInfoUrl="https://pmd.github.io/pmd-5.3.3/pmd-java/rules/java/strings.html#AvoidDuplicateLiterals" priority="3">
The String literal &quot;Tokenizer &#xd801;&#xdc1c;test&quot; appears 4 times in this file; the first occurrence is on line 177
</violation>

The issue in the above block is escape sequences like &#xd801; (see http://stackoverflow.com/a/5110103/574815). When an XML parser (e.g. java's xml.stream) reaches this token, it throws an exception like

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[230,44]
Message: Character reference "&#
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source) ~[na:1.8.0_20]
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source) ~[na:1.8.0_20]

It seems the issue is not invalid XML syntax, but an invalid XML character. D801 lies outside the valid XML character range.

I encountered this while analyzing the Apache Lucene codebase; the violation reported above came from this file: https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestAnalyzers.java

I'll also attach a copy of that file, and a copy of the xml output from analyzing just that file, for convenience.

I'm not exactly sure what should be done about this. Maybe the characters should be redacted from the output. Possibly better would be to represent characters that would have been invalid in xml as their java escape sequence, i.e. &#xD801; would instead be \uD801

2 Attachments

TestAnalyzers.java

pmd output for TestAnalyzers.xml

Discussion

Andreas Dangel - 2015-10-16

status: open --> closed

assigned_to: Andreas Dangel

Milestone: New Tickets --> PMD-5.3.6
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andreas Dangel - 2015-10-16

This will be fixed with PMD 5.3.6 and later.

Commit: https://github.com/pmd/pmd/commit/3393507082938c28f62d1e08cc2e39092ff277df

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andreas Dangel - 2016-06-25

labels: --> core

summary: Invalid XML Characters in Output --> [core] XMLRenderer: Invalid XML Characters in Output
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[core] XMLRenderer: Invalid XML Characters in Output

A source code analyzer

Milestone

Searches

Help

#1425 [core] XMLRenderer: Invalid XML Characters in Output

Discussion