#120 Input errors when using special characters in code.

release_3.2
closed
5
2012-10-10
2003-06-17
Jorge Murcia
No

When trying to 'checkstyle' source files with special
characters in identifiers like 'á', 'é' and the like, got input
errors and the check process fails. The tested files were
ISO-8859-1 enconded.

Discussion

  • Logged In: YES
    user_id=746148

    Could you please provide sample of the input file.

     
  • Tim Tyler
    Tim Tyler
    2003-06-17

    Logged In: YES
    user_id=796025

    int oXo; // The X represents hex 0xC0

    ...seems to produce the problem for me.

    Line 0: "Got an Exception - unexpected char".

    CS 3.1 via Eclipse.

    If I use 0x80 or 0xA0 instead things go wrong during
    compilation - but the line number is correct - not the
    normal stigmata of a Checkstyle exception.

     
  • Jorge Murcia
    Jorge Murcia
    2003-06-17

    A sample source code with configuration xml.

     
    Attachments
  • Logged In: YES
    user_id=746148

    The problem in our lexer's rulre for IDENT now we accept only latin1 letters and digits.
    Here is a link to correct description of indentifiers:
    http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#40625

    We need to modify oru lexer to use Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() to resolve the proble completely.

     
  • Logged In: YES
    user_id=746148

    One of the possible way to fix this bug is use follows rule for identifiers:
    IDENT
    options {testLiterals=true;}
    : ('a'..'z'|'A'..'Z'|''|'$'|'\u0080'..'\ufffe') ('a'..'z'|'A'..'Z'|''|'0'..'9'|'$'|'\u0080'..'\ufffe')*
    ;

    Although it is not exactly what JLS describes.

     
  • Oliver Burn
    Oliver Burn
    2003-06-19

    Logged In: YES
    user_id=218824

    I asked my friend who developed Clover what he thought about
    this. Here is his response.

    Oliver,

    We tried to rewrite the IDENT definition in the lexer
    without success. We
    are still using Antlr 2.7.1, and the bitsets generated to
    support the
    i18ned IDENT def were huge, and from memory the generated
    lexer broke. If
    you are using Antlr 2.7.2, your mileage may vary, because
    bitset generation
    was supposed to have been optimised.

    I ended up with a very yucky solution: generate the Lexer,
    and then hand-
    edit the generated code to use
    Character.isJavaIdentifierStart/Part() in
    the ident matching method.

    -Brendan

     
  • Logged In: YES
    user_id=746148

    I've committed in CVS for 3.2 changes suggested by Joe Comuzzi. And them fix the problem.

    IDENT
    options {testLiterals=true;}
    : ('a'..'z'|'A'..'Z'|''|'$'| {Character.isJavaIdentifierStart(LA(1))}? '\u0080'..'\ufffe')
    ('a'..'z'|'A'..'Z'|'
    '|'0'..'9'|'$'| {Character.isJavaIdentifierPart(LA(1))}? '\u0080'..'\ufffe')*
    ;