Menu

#351 Issue with UTF-8 char on Windows 7

5.6.1
closed
None
1
2015-02-02
2013-03-20
No

On windows, UTF-8 characters in Java files are causing Checkstyle errors:
"Got an exception - expecting ''', found '¡'"

Here is an example function which exhibits the issue (there error is reported on the lastChar == 'á' line)

~~~~~
/*
* Given a string and a locale, attempts to pluralize the string.
/
public static String pluralize(String term, ApplicationLocale locale) {

    if (StringUtils.isEmpty(term)) {
        return term;
    }

    char lastChar = term.charAt(term.length() - 1);

    String pluralized;
    if (ApplicationLocale.SPANISH == locale) {
        if (lastChar == 'a' || lastChar == 'á'
            || lastChar == 'e' || lastChar == 'é'
            || lastChar == 'i' || lastChar == 'í'
            || lastChar == 'o' || lastChar == 'ó'
            || lastChar == 'u' || lastChar == 'ú') {

            pluralized = term + "s";

        } else {

            pluralized = term + "es";

        }

    } else {

        if (lastChar == 's' || lastChar == 'x') {
            pluralized = term + "es";
        } else {
            pluralized = term + "s";
        }
    }

    return pluralized;
}

~~~~

Discussion

  • Lars Koedderitzsch

    Hi, did you set the "charset" property of the root checker module in your configuration file to the appropriate encoding?
    Otherwise Checkstyle will be reading/parsing files using the platform encoding, which could lead to such problems you encounter.

     
  • Jasper Rosenberg

    Hi Lars. Yes unfortunately we did set the charset property (and I made sure to reload the cached file as well). It does seem to be Windows 7 specific as the developers on Ubuntu and OS X didn't see it. Also it does appear to be some kind of parsing issue because this fails:

    lastChar == 'á'

    but this passes fine:

    lastChar == "á".charAt(0)

     
  • Lars Koedderitzsch

    If you suspect a parsing issue - compared to an encoding related issue - then it would be a Checkstyle core issue, which you should report at the Checkstyle project directly.

    However, I still suspect this to be an encoding/settings issue. The error indicates that the char definition is presented to Checkstyle as 2 characters, which it doesn't expect for a char definition (expecting ''', found '¡'").
    And this in turn should only happen if the binary file is read to a textual representation using the wrong encoding - e.g. reading a UTF-8 encoded file with ISO-8859-1/CP-1252 encoding, effectivly turning a 2-byte UTF-8 representation into 2 characters, instead of 1 character when read with UTF-8 encoding.
    This would also explain why you only see this on windows where CP-1252 is the default platform encoding, opposed to Ubuntu/OSX where it's UTF-8.

    As the mentioned 'charset' property is the only means to instruct Checkstyle to use a different encoding than platform encoding and the plugin contains no other facilities in this section please double check the setting and that the correct file is actually used.

     

    Last edit: Lars Koedderitzsch 2013-04-04
  • Jasper Rosenberg

    Hi Lars,

    I will double check the charset property, but could that explain why "á" is fine but 'á' fails?

    Thanks,
    Jasper

     
  • Lars Koedderitzsch

    Hi Jasper, I believe it could explain it, at least according to my mental model ;-)

    The á is a 2-byte character in UTF-8, so it's represented as 2 bytes in the binary file (UFT-8 encoded).
    When read with the wrong encoding, e.g. ISO-8859-1 (CP-1252), the 2 bytes become 2 characters (since ISO encoding only has 1-byte character representations).
    This way the string processed in Checkstyle contains a char declaration appearing to contain 2 characters - and that is what throws off Checkstyle's internal parsing, likely because the interal ANTLR syntax model only expects one character in a char litaral (logical) - hence Checkstyle's error message "Got an exception - expecting ''', found '¡'" - it expects the termination of the char literal (') but instead gets the second character of the mis-represented 2-byte character.

    If all of that is true then the problem should go away when Checkstyle reads the file with the proper encoding (UTF-8), and the charset property is the only way to do so (short of changing the "file.encoding" system property upon JVM startup).

     
  • Jasper Rosenberg

    Hi Lars. Thanks for the explanation. I'm embarrassed to say that I went back to the file today and reverted it to the way it was blowing up, and checkstyle didn't make a peep. I'm guessing that there must have been some kind of caching going on that hadn't picked up my <property name="charset" value="UTF-8"/> declaration before. (I had done a refresh and clean rebuild, as well as disable and enable checkstyle!) Thanks again for your help!

     
  • Jasper Rosenberg

    (last comment lost the xml tag)

    ... picked up my <property name="charset" value="UTF-8"/> declaration before.

     
  • Lars Koedderitzsch

    • status: open --> closed
    • assigned_to: Lars Koedderitzsch
     

Log in to post a comment.