#1454 R" stops syntax highlighting for C, C++

Bug
closed-fixed
Neil Hodgson
5
2014-08-25
2013-03-15
froschbrocken
No

The combination of characters R" (i.e. a capital R followed by a quote) prevents the rest of the file from being syntax highlighted in C and C++ language modes, as demonstrated in this short C program:

#include <stdio.h>

#define R "buggy"

int main()
{
printf("Hello, "R" world!\n");
// this comment is no longer syntax highlighted!

return 0;
}

This has been observed on SciTE v3.2.5.

Discussion

  • Neil Hodgson
    Neil Hodgson
    2013-03-15

    • labels: C, syntax --> C, syntax, lexer
    • assigned_to: Neil Hodgson
     
  • froschbrocken
    froschbrocken
    2013-03-15

    Ah, I see. Didn't know this construct, thanks for pointing it out. In this case, the parsing rule should definitely be adapted so that this works correctly. In a regular expression, this might be something like this:

    [^"\w](u|u8|U|L)?R"\(.*?\)"
    

    However, it only matches the first example on the Wikipedia page, not the one with an additional delimiter string, but that's not the point here.

     
  • Neil Hodgson
    Neil Hodgson
    2013-03-18

    Fixed by rejecting raw strings when character after " is in " )\\\t\v\f\n".
    The C++11 standard documents this in the section "String literals".

    Committed as [71d931].

     

    Related

    Commit: [71d931]

  • froschbrocken
    froschbrocken
    2013-03-18

    Are you sure that this fixes the problem consequently? If I modify the above example by moving the whitespace into the define, I would say it does not work either:

    #include <stdio.h>
    
    #define R "buggy "
    
    int main()
    {
    printf("Hello, "R"world!\n");
    // this comment is no longer syntax highlighted!
    
    return 0;
    }
    
     
    • Neil Hodgson
      Neil Hodgson
      2013-03-18

      The fix was only for the case where the character immediately following the " is invalid. I do not know how the standard would interpret R"world!\n" since it looks like a raw string up until the \.

       
  • Neil Hodgson
    Neil Hodgson
    2013-03-18

    clang thinks it is a quite bogus raw string:

    >clang++ --std=c++11 raw.cpp
    raw.cpp:7:17: error: invalid suffix on literal; C++11 requires a space between literal and identifier [-Wreserved-user-defined-literal]
    printf("Hello, "R"world!\n");
                    ^
    raw.cpp:7:25: error: invalid character '\' character in raw string delimiter; use PREFIX( )PREFIX to delimit raw string
    printf("Hello, "R"world!\n");
                            ^
    raw.cpp:7:17: error: expected ')'
    printf("Hello, "R"world!\n");
                    ^
    raw.cpp:7:7: note: to match this '('
    printf("Hello, "R"world!\n");
          ^
    3 errors generated.
    
     
  • froschbrocken
    froschbrocken
    2013-03-18

    You are totally right. I just tried compilig this example with g++ using the -std=c++0x parameter in order to compile it with raw string support, and it reports an error:

    cpp11test.cpp:7:25: error: invalid character '\' in raw string delimiter
    cpp11test.cpp:7:1: error: stray âRâ in program
    

    So, what we could have expected is that you cannot define a macro called "R" in C++11, since the symbol already exists. However, it is a vaild example in pure C, so is there a way to separate the C parsing from the C++ parsing?

    Furthermore, if we extend the macro name to any string ending on an "R" it compiles in C as well as in C++11:

    #include <stdio.h>
    
    #define ABCR "buggy "
    
    int main()
    {
    printf("Hello, "ABCR"world!\n");
    // this comment is no longer syntax highlighted!
    
    return 0;
    }
    

    So, the parser should evaluate the characters BEFORE the R, that's what I tried to express above with the regexp ^"\w?R"(.*?)".

     
  • froschbrocken
    froschbrocken
    2013-03-18

    Oh, forgot to indent the regexp:

    [^"\w](u|u8|U|L)?R"\(.*?\)"
    
     
  • froschbrocken
    froschbrocken
    2013-03-18

    The quote character has to be removed from the beginning, and I removed the end for clearness:

    [^\w](u|u8|U|L)?R"
    

    In plain words: The only allowed character sequences before the "R" are "u", "u8", "U" or "L". Everything else before it must not be an alphanumeric characer, otherwise it may not be treated as a raw string.

     
  • Neil Hodgson
    Neil Hodgson
    2013-03-18

    Is there an actual problem with the ABCR example with the updated Scintilla?

    There could be a C-only mode but that could hide potential bugs when a file is reused as C++. Particularly a problem for headers.

     
  • froschbrocken
    froschbrocken
    2013-03-18

    I cannot try since I haven't built Scintilla from source. Have you tried it?

     
    • Neil Hodgson
      Neil Hodgson
      2013-03-18

      Looks fine to me. I haven't been able to work out what the difference is to the current behaviour that you are trying to convey with the regex.

       
  • froschbrocken
    froschbrocken
    2013-03-18

    Oh, you are right! Didn't realize that the ABCR example already worked in the released version. Was confused, because in Notepad++ it did not work. However, thanks for the fix!

     
  • Neil Hodgson
    Neil Hodgson
    2013-04-01

    • status: open --> closed-fixed