Eclipse Checkstyle Plug-in / Bugs (Deprecated - use Github) / #351 Issue with UTF-8 char on Windows 7

#351 Issue with UTF-8 char on Windows 7

Milestone: 5.6.1

Status: closed

Owner: Lars Koedderitzsch

Labels: None

Priority: 1

Updated: 2015-02-02

Created: 2013-03-20

Creator: Jasper Rosenberg

Private: No

On windows, UTF-8 characters in Java files are causing Checkstyle errors:
"Got an exception - expecting ''', found '¡'"

Here is an example function which exhibits the issue (there error is reported on the lastChar == 'á' line)

~~~~~
/*
* Given a string and a locale, attempts to pluralize the string.
/
public static String pluralize(String term, ApplicationLocale locale) {

    if (StringUtils.isEmpty(term)) {
        return term;
    }

    char lastChar = term.charAt(term.length() - 1);

    String pluralized;
    if (ApplicationLocale.SPANISH == locale) {
        if (lastChar == 'a' || lastChar == 'á'
            || lastChar == 'e' || lastChar == 'é'
            || lastChar == 'i' || lastChar == 'í'
            || lastChar == 'o' || lastChar == 'ó'
            || lastChar == 'u' || lastChar == 'ú') {

            pluralized = term + "s";

        } else {

            pluralized = term + "es";

        }

    } else {

        if (lastChar == 's' || lastChar == 'x') {
            pluralized = term + "es";
        } else {
            pluralized = term + "s";
        }
    }

    return pluralized;
}

~~~~

Discussion

Lars Koedderitzsch - 2013-03-21

Hi, did you set the "charset" property of the root checker module in your configuration file to the appropriate encoding?
Otherwise Checkstyle will be reading/parsing files using the platform encoding, which could lead to such problems you encounter.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jasper Rosenberg - 2013-03-22

Hi Lars. Yes unfortunately we did set the charset property (and I made sure to reload the cached file as well). It does seem to be Windows 7 specific as the developers on Ubuntu and OS X didn't see it. Also it does appear to be some kind of parsing issue because this fails:

lastChar == 'á'

but this passes fine:

lastChar == "á".charAt(0)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lars Koedderitzsch - 2013-04-04

If you suspect a parsing issue - compared to an encoding related issue - then it would be a Checkstyle core issue, which you should report at the Checkstyle project directly.

However, I still suspect this to be an encoding/settings issue. The error indicates that the char definition is presented to Checkstyle as 2 characters, which it doesn't expect for a char definition (expecting ''', found '¡'").
And this in turn should only happen if the binary file is read to a textual representation using the wrong encoding - e.g. reading a UTF-8 encoded file with ISO-8859-1/CP-1252 encoding, effectivly turning a 2-byte UTF-8 representation into 2 characters, instead of 1 character when read with UTF-8 encoding.
This would also explain why you only see this on windows where CP-1252 is the default platform encoding, opposed to Ubuntu/OSX where it's UTF-8.

As the mentioned 'charset' property is the only means to instruct Checkstyle to use a different encoding than platform encoding and the plugin contains no other facilities in this section please double check the setting and that the correct file is actually used.

Last edit: Lars Koedderitzsch 2013-04-04

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jasper Rosenberg - 2013-04-04

Hi Lars,

I will double check the charset property, but could that explain why "á" is fine but 'á' fails?

Thanks,
Jasper

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lars Koedderitzsch - 2013-04-04

Hi Jasper, I believe it could explain it, at least according to my mental model ;-)

The á is a 2-byte character in UTF-8, so it's represented as 2 bytes in the binary file (UFT-8 encoded).
When read with the wrong encoding, e.g. ISO-8859-1 (CP-1252), the 2 bytes become 2 characters (since ISO encoding only has 1-byte character representations).
This way the string processed in Checkstyle contains a char declaration appearing to contain 2 characters - and that is what throws off Checkstyle's internal parsing, likely because the interal ANTLR syntax model only expects one character in a char litaral (logical) - hence Checkstyle's error message "Got an exception - expecting ''', found '¡'" - it expects the termination of the char literal (') but instead gets the second character of the mis-represented 2-byte character.

If all of that is true then the problem should go away when Checkstyle reads the file with the proper encoding (UTF-8), and the charset property is the only way to do so (short of changing the "file.encoding" system property upon JVM startup).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jasper Rosenberg - 2013-04-09

Hi Lars. Thanks for the explanation. I'm embarrassed to say that I went back to the file today and reverted it to the way it was blowing up, and checkstyle didn't make a peep. I'm guessing that there must have been some kind of caching going on that hadn't picked up my <property name="charset" value="UTF-8"/> declaration before. (I had done a refresh and clean rebuild, as well as disable and enable checkstyle!) Thanks again for your help!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jasper Rosenberg - 2013-04-09

(last comment lost the xml tag)

... picked up my <property name="charset" value="UTF-8"/> declaration before.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lars Koedderitzsch - 2013-06-27

status: open --> closed

assigned_to: Lars Koedderitzsch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.