#1454 R" stops syntax highlighting for C, C++

Bug
closed-fixed
5
2014-08-25
2013-03-15
No

The combination of characters R" (i.e. a capital R followed by a quote) prevents the rest of the file from being syntax highlighted in C and C++ language modes, as demonstrated in this short C program:

#include <stdio.h>

#define R "buggy"

int main()
{
printf("Hello, "R" world!\n");
// this comment is no longer syntax highlighted!

return 0;
}

This has been observed on SciTE v3.2.5.

Discussion

  • Neil Hodgson

    Neil Hodgson - 2013-03-15
    • labels: C, syntax --> C, syntax, lexer
    • assigned_to: Neil Hodgson
     
  • froschbrocken

    froschbrocken - 2013-03-15

    Ah, I see. Didn't know this construct, thanks for pointing it out. In this case, the parsing rule should definitely be adapted so that this works correctly. In a regular expression, this might be something like this:

    [^"\w](u|u8|U|L)?R"\(.*?\)"
    

    However, it only matches the first example on the Wikipedia page, not the one with an additional delimiter string, but that's not the point here.

     
  • Neil Hodgson

    Neil Hodgson - 2013-03-18

    Fixed by rejecting raw strings when character after " is in " )\\\t\v\f\n".
    The C++11 standard documents this in the section "String literals".

    Committed as [71d931].

     

    Related

    Commit: [71d931]

  • froschbrocken

    froschbrocken - 2013-03-18

    Are you sure that this fixes the problem consequently? If I modify the above example by moving the whitespace into the define, I would say it does not work either:

    #include <stdio.h>
    
    #define R "buggy "
    
    int main()
    {
    printf("Hello, "R"world!\n");
    // this comment is no longer syntax highlighted!
    
    return 0;
    }
    
     
    • Neil Hodgson

      Neil Hodgson - 2013-03-18

      The fix was only for the case where the character immediately following the " is invalid. I do not know how the standard would interpret R"world!\n" since it looks like a raw string up until the \.

       
  • Neil Hodgson

    Neil Hodgson - 2013-03-18

    clang thinks it is a quite bogus raw string:

    >clang++ --std=c++11 raw.cpp
    raw.cpp:7:17: error: invalid suffix on literal; C++11 requires a space between literal and identifier [-Wreserved-user-defined-literal]
    printf("Hello, "R"world!\n");
                    ^
    raw.cpp:7:25: error: invalid character '\' character in raw string delimiter; use PREFIX( )PREFIX to delimit raw string
    printf("Hello, "R"world!\n");
                            ^
    raw.cpp:7:17: error: expected ')'
    printf("Hello, "R"world!\n");
                    ^
    raw.cpp:7:7: note: to match this '('
    printf("Hello, "R"world!\n");
          ^
    3 errors generated.
    
     
  • froschbrocken

    froschbrocken - 2013-03-18

    You are totally right. I just tried compilig this example with g++ using the -std=c++0x parameter in order to compile it with raw string support, and it reports an error:

    cpp11test.cpp:7:25: error: invalid character '\' in raw string delimiter
    cpp11test.cpp:7:1: error: stray âRâ in program
    

    So, what we could have expected is that you cannot define a macro called "R" in C++11, since the symbol already exists. However, it is a vaild example in pure C, so is there a way to separate the C parsing from the C++ parsing?

    Furthermore, if we extend the macro name to any string ending on an "R" it compiles in C as well as in C++11:

    #include <stdio.h>
    
    #define ABCR "buggy "
    
    int main()
    {
    printf("Hello, "ABCR"world!\n");
    // this comment is no longer syntax highlighted!
    
    return 0;
    }
    

    So, the parser should evaluate the characters BEFORE the R, that's what I tried to express above with the regexp ^"\w?R"(.*?)".

     
  • froschbrocken

    froschbrocken - 2013-03-18

    Oh, forgot to indent the regexp:

    [^"\w](u|u8|U|L)?R"\(.*?\)"
    
     
  • froschbrocken

    froschbrocken - 2013-03-18

    The quote character has to be removed from the beginning, and I removed the end for clearness:

    [^\w](u|u8|U|L)?R"
    

    In plain words: The only allowed character sequences before the "R" are "u", "u8", "U" or "L". Everything else before it must not be an alphanumeric characer, otherwise it may not be treated as a raw string.

     
  • Neil Hodgson

    Neil Hodgson - 2013-03-18

    Is there an actual problem with the ABCR example with the updated Scintilla?

    There could be a C-only mode but that could hide potential bugs when a file is reused as C++. Particularly a problem for headers.

     
  • froschbrocken

    froschbrocken - 2013-03-18

    I cannot try since I haven't built Scintilla from source. Have you tried it?

     
    • Neil Hodgson

      Neil Hodgson - 2013-03-18

      Looks fine to me. I haven't been able to work out what the difference is to the current behaviour that you are trying to convey with the regex.

       
  • froschbrocken

    froschbrocken - 2013-03-18

    Oh, you are right! Didn't realize that the ABCR example already worked in the released version. Was confused, because in Notepad++ it did not work. However, thanks for the fix!

     
  • Neil Hodgson

    Neil Hodgson - 2013-04-01
    • status: open --> closed-fixed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks