If for any reason an error occurs when a source file is "tokenized", the following exception is thrown but there is absolutely no way to know which source file is "corrupted".
Caused by: net.sourceforge.pmd.ast.TokenMgrError: Lexical error at
line 1, column 1. Encountered: "#" (35), after : ""
at net.sourceforge.pmd.ast.JavaParserTokenManager.getNextToken(JavaParserTokenManager.java:2040)
at net.sourceforge.pmd.cpd.JavaTokenizer.tokenize(JavaTokenizer.java:38)
at net.sourceforge.pmd.cpd.CPD.add(CPD.java:106)
at net.sourceforge.pmd.cpd.CPD.add(CPD.java:72)
at org.sonar.plugins.cpd.CpdSensor.configureCPD(CpdSensor.java:93)
at org.sonar.plugins.cpd.CpdSensor.executeCPD(CpdSensor.java:73)
... 30 more
This is fixed on trunk with the unification of all the JavaCC parsers in PMD. I will not be fixing this on 4.2.x line. IIRC, the C++ on 4.2.x provides file names, but the Java/JSP parsers do not.